Donald John Trump, 45th president of the United States of America, is a prolific twitter with a following of more than 20 million. Could we use a simple machine learning algorithm to identify if a given tweet is from Donald Trump? Yes! An effective and yet simple ML algorithm is Naive Bayes. This algorithm has been used extensively to classify spam emails.
In this case study, we use a dataset comprising 200 tweets from Donald Trump and others to train and test our Naive Bayses classifier. The dataset is read from a csv file into allTweets (our Pandas DataFrame). See the allTweets data structure below.
The dataset is randomly split up 80/20 – 80% allocated for the training and 20% testing of our tweets classifier. The simple Naive Bayes algorithm worked very well, achieving a high accuracy of 87.5%! See the confusion matrix below.
We have built a simple utility function identifyTweet() to return the prediction for a given tweet. See some sample runs of the utility function below.
The Python code is shown below: