Data preparation
During the 2016 US election, Russian tweet bots were used to constantly distribute political rhetoric to both democrats and republicans. You have been given a dataset of such tweets called russian_tweets
. You have decided to classify these tweets as either left- (democrat) or right-leaning(republican). Before you can build a classification model, you need to clean and prepare the text for modeling.
This exercise is part of the course
Introduction to Natural Language Processing in R
Exercise instructions
- Finalize the tokenization process by stemming the tokens.
- Use
cast_dtm()
to create a document-term matrix. - Weight the document-term matrix using tfidf weighting.
- Print the matrix.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Stem the tokens
russian_tokens <- russian_tweets %>%
unnest_tokens(output = "word", token = "words", input = content) %>%
anti_join(stop_words) %>%
___(word = ___(word))
# Create a document term matrix using TFIDF weighting
tweet_matrix <- russian_tokens %>%
count(tweet_id, word) %>%
___(document = ___, term = ___,
value = n, weighting = tm::___)
# Print the matrix details
___