1. Classification modeling
Now that we know how to prepare data for modeling,
let's start classifying text.
2. Recap of the steps
In this previous lesson, we cleaned and prepared data on sentences from the book animal farm that contained
the horse Boxer or the pig Napoleon.
We created tokens of the words used,
and turned this into a document-term matrix with TFIDF weighting.
Lets start working on the last three steps of classification modeling, by learning how to split the dataset into training and testing datasets.
3. Step 2: split the data
There are many ways to do this in R, but we will use the sample function from base R.
Let's set a random seed, so that we can reproduce our results. This ensures we get the same train and test sample each time. Here, we randomly select 80% of the 150 sentences to use in training our model, and leave 20% of the rows for testing.
Lastly, we split the matrix into the two datasets using the train indices.
4. Random forest models
For simplicity, we will use random forest models to do our classification.
If you want more details on how random forest models work, their parameters, or have any question about these models, please see the Machine Learning with Tree-Based Models in R course.
5. Classification example
Using the randomForest implementation from the randomForest package, we only need to provide the training data, x, and the response values, y.
Let's take a minute to think through what we are using for x and y. x, is a data.frame of the document-term matrix for each word in each sentence. While y, is a binary classification of either "boxer", for sentences that used to contain the name for the horse Boxer, or "napoleon", for sentences that used to contain the name for the pig Napoleon.
After the model finishes training, we get the confusion matrix for the training dataset - suggesting an accuracy of almost 80%.
6. The confusion matrix
While discussing accuracy, AUC curves, and other metrics used to select the best performing model is out of the scope of this course, we will cover at least cover how to read a confusion matrix.
From our previous model, we have the following output. The rows of the matrix represent the actual labels, while the columns represent the predicted values.
There were 37 sentences that were for Boxer that we predicted to be for Boxer, and 20 sentences for Boxer that we predicted to be for Napoleon. Similarly, we accurately predicted 55 out of the 63 total sentences for Napoleon.
If we add the diagonal entires together, we can find the overall accuracy of our model on the training set. Here we have 76% accuracy.
7. Test set predictions
In order to report the test accuracy, we need to make predictions on the test dataset, and compare them to the actual values.
We can use the predict function with our classification model rfc, to predict on the new data. We also supply predict with a dataframe of the TFIDF weights for the test dataset.
We create a confusion matrix for the results using table and review its output.
In this example we accurately predicted 14 out of 18 sentences for Boxer, and 10 out of 12 for Napoleon.
This is an accuracy of 80%.
8. Classification practice
Let's work through a few examples for classifying text.