Get startedGet started for free

Training and testing a classification model with scikit-learn

1. Training and testing a classification model with scikit-learn

In this video, we'll use the features we have extracted to train and test a supervised classification model.

2. Naive Bayes classifier

A Naive Bayes model is commonly used for testing NLP classification problems because of its basis in probability. Naive bayes algorithm uses probability, attempting to answer the question if given a particular piece of data, how likely is a particular outcome? For example, thinking back to our movie genres dataset -- If the plot has a spaceship, how likely is it that the movie is Sci-Fi? And given a Spaceship and an alien how likely NOW is it a sci-fi movie? Each word acts as a feature from our CountVectorizer helping classify our text using probability. Naive bayes has been used for text classification problems since the 1960s and continues to be used today despite the growth of many other models, algorithms and neural network architectures. That said, it is not always the best tool for the job, but it is a simple and effective one you will use to build a fake news classifier.

3. Naive Bayes with scikit-learn

We'll use scikit learn's naive bayes to take a look at our scifi versus action plot classification problem. Recall the data we're using is simply IMDB plot summaries, and whether the movie is science fiction or action. First, we import the naive bayes model class, multinomial naive bayes, which works well with count vectorizers as it expects integer inputs. MultinomialNB is also used for multiple label classification. This model may not work as well with floats, such as tfidf weighted inputs. Instead, use support vector machines or even linear models; although I recommend trying Naive Bayes first to determine if it can also work well. We use the metrics module to evaluate model performance. We initialize our class and call fit with the training data. If you recall from the previous video, this will determine the internal parameters based on the dataset. We pass the training count vectorizer first and the training labels second. After fitting the model, we call predict with the count vectorizer test data. Predict will use the trained model to predict the label based on the test data vectors. We save the predicted labels in variable pred to test the accuracy. Finally, we test accuracy using accuracy_score from the metrics module and passing the predicted and test labels. Accuracy for our model means the percentage of correct genre guesses out of total guesses. Our model has about 86% accuracy -- which is pretty good for a first try! You'll be applying the Multinomial Naive Bayes classifier to the fake news dataset in the following exercises.

4. Confusion matrix

To further evaluate our model, we can also check the confusion matrix which shows correct and incorrect labels. The confusion_matrix function from the metrics module takes the test labels, the predictions and a list of labels. If the label list is not passed, scikit learn will order them using Python ordering. The confusion matrix is a bit easier to read when we transform it into a table. The first value and last value of the matrix (or the main diagonal of the matrix) show true scores, meaning, true classification of both action and scifi films based on the plot bag of words vectors. In a confusion matrix, the predicted labels are shown across the top and the true labels are shown down the side. This confusion matrix shows 864 Sci-Fi movies incorrectly labeled as Action and 563 Action movies incorrectly labeled as Sci-Fi. We can see from the distribution of true positives and negatives that our dataset is a bit skewed, we have many more action films than sci-fi. This could be one reason that our action movies are predicted more accurately.

5. Let's practice!

Now it's your turn to train and test a naive bayes model for the fake news problem!