Get startedGet started for free

Data preparation

During the 2016 US election, Russian tweet bots were used to constantly distribute political rhetoric to both democrats and republicans. You have been given a dataset of such tweets called russian_tweets. You have decided to classify these tweets as either left- (democrat) or right-leaning(republican). Before you can build a classification model, you need to clean and prepare the text for modeling.

This exercise is part of the course

Introduction to Natural Language Processing in R

View Course

Exercise instructions

  • Finalize the tokenization process by stemming the tokens.
  • Use cast_dtm() to create a document-term matrix.
  • Weight the document-term matrix using tfidf weighting.
  • Print the matrix.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Stem the tokens
russian_tokens <- russian_tweets %>%
  unnest_tokens(output = "word", token = "words", input = content) %>%
  anti_join(stop_words) %>%
  ___(word = ___(word))

# Create a document term matrix using TFIDF weighting
tweet_matrix <- russian_tokens %>%
  count(tweet_id, word) %>%
  ___(document = ___, term = ___,
           value = n, weighting = tm::___)

# Print the matrix details 
___
Edit and Run Code