Classification modeling example
You have previously prepared a set of Russian tweets for classification. Of the 20,000 tweets, you have filtered to tweets with an account_type
of Left
or Right
, and selected the first 2000 tweets of each. You have already tokenized the tweets into words, removed stop words, and performed stemming. Furthermore, you converted word counts into a document-term matrix with TFIDF values for weights and saved this matrix as: left_right_matrix_small
.
You will use this matrix to predict whether a tweet was generated from a left-leaning tweet bot, or a right-leaning tweet bot. The labels can be found in the vector, left_right_labels
.
Diese Übung ist Teil des Kurses
Introduction to Natural Language Processing in R
Anleitung zur Übung
- Set the random seed to
1111
for reproducibility. - Create training and test datasets. Use a 75% sample for the training data.
- Run a random forest model on the training data, use
left_right_labels
for the response vectory
. - Print the random forest results.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
library(randomForest)
# Create train/test split
set.___(___)
sample_size <- floor(___ * nrow(left_right_matrix_small))
train_ind <- ___(nrow(left_right_matrix_small), size = ___)
train <- left_right_matrix_small[___, ]
test <- left_right_matrix_small[-___, ]
# Create a random forest classifier
rfc <- randomForest(x = as.data.frame(as.matrix(___)),
y = ___[___],
nTree = 50)
# Print the results
___