Classification modeling example

You have previously prepared a set of Russian tweets for classification. Of the 20,000 tweets, you have filtered to tweets with an account_type of Left or Right, and selected the first 2000 tweets of each. You have already tokenized the tweets into words, removed stop words, and performed stemming. Furthermore, you converted word counts into a document-term matrix with TFIDF values for weights and saved this matrix as: left_right_matrix_small.

You will use this matrix to predict whether a tweet was generated from a left-leaning tweet bot, or a right-leaning tweet bot. The labels can be found in the vector, left_right_labels.

This exercise is part of the course

Introduction to Natural Language Processing in R

View Course

Exercise instructions

Set the random seed to 1111 for reproducibility.
Create training and test datasets. Use a 75% sample for the training data.
Run a random forest model on the training data, use left_right_labels for the response vector y.
Print the random forest results.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

library(randomForest)

# Create train/test split
set.___(___)
sample_size <- floor(___ * nrow(left_right_matrix_small))
train_ind <- ___(nrow(left_right_matrix_small), size = ___)
train <- left_right_matrix_small[___, ]
test <- left_right_matrix_small[-___, ]

# Create a random forest classifier
rfc <- randomForest(x = as.data.frame(as.matrix(___)), 
                    y = ___[___],
                    nTree = 50)
# Print the results
___

Edit and Run Code

Introduction to Natural Language Processing in R

IntermediateSkill Level

4.8+

40 reviews

In chapter 4 we cover two staples of natural language processing, sentiment analysis, and word embeddings. These are two analysis techniques that are a must for anyone learning the fundamentals of text analysis. Furthermore, you will briefly learn about BERT, part-of-speech tagging, and named entity recognition. Almost 15 different analysis techniques were covered in this course, so chapter 4 ends by recapping all of the great techniques you will learn about in this course.

Exercise 1: Sentiment analysis Exercise 2: tidytext lexicons Exercise 3: Sentiment scores Exercise 4: Sentiment and emotion Exercise 5: Word embeddings Exercise 6: h2o practice Exercise 7: word2vec Exercise 8: Additional NLP analysis Exercise 9: Reviewing methods #1 Exercise 10: Review methods #2 Exercise 11: Conclusion