Classification modeling example
You have previously prepared a set of Russian tweets for classification. Of the 20,000 tweets, you have filtered to tweets with an account_type
of Left
or Right
, and selected the first 2000 tweets of each. You have already tokenized the tweets into words, removed stop words, and performed stemming. Furthermore, you converted word counts into a document-term matrix with TFIDF values for weights and saved this matrix as: left_right_matrix_small
.
You will use this matrix to predict whether a tweet was generated from a left-leaning tweet bot, or a right-leaning tweet bot. The labels can be found in the vector, left_right_labels
.
This exercise is part of the course
Introduction to Natural Language Processing in R
Exercise instructions
- Set the random seed to
1111
for reproducibility. - Create training and test datasets. Use a 75% sample for the training data.
- Run a random forest model on the training data, use
left_right_labels
for the response vectory
. - Print the random forest results.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
library(randomForest)
# Create train/test split
set.___(___)
sample_size <- floor(___ * nrow(left_right_matrix_small))
train_ind <- ___(nrow(left_right_matrix_small), size = ___)
train <- left_right_matrix_small[___, ]
test <- left_right_matrix_small[-___, ]
# Create a random forest classifier
rfc <- randomForest(x = as.data.frame(as.matrix(___)),
y = ___[___],
nTree = 50)
# Print the results
___