Aan de slagGa gratis aan de slag

Data preparation

During the 2016 US election, Russian tweet bots were used to constantly distribute political rhetoric to both democrats and republicans. You have been given a dataset of such tweets called russian_tweets. You have decided to classify these tweets as either left- (democrat) or right-leaning(republican). Before you can build a classification model, you need to clean and prepare the text for modeling.

Deze oefening maakt deel uit van de cursus

Introduction to Natural Language Processing in R

Cursus bekijken

Oefeninstructies

  • Finalize the tokenization process by stemming the tokens.
  • Use cast_dtm() to create a document-term matrix.
  • Weight the document-term matrix using tfidf weighting.
  • Print the matrix.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Stem the tokens
russian_tokens <- russian_tweets %>%
  unnest_tokens(output = "word", token = "words", input = content) %>%
  anti_join(stop_words) %>%
  ___(word = ___(word))

# Create a document term matrix using TFIDF weighting
tweet_matrix <- russian_tokens %>%
  count(tweet_id, word) %>%
  ___(document = ___, term = ___,
           value = n, weighting = tm::___)

# Print the matrix details 
___
Code bewerken en uitvoeren