ComeçarComece de graça

Data preparation

During the 2016 US election, Russian tweet bots were used to constantly distribute political rhetoric to both democrats and republicans. You have been given a dataset of such tweets called russian_tweets. You have decided to classify these tweets as either left- (democrat) or right-leaning(republican). Before you can build a classification model, you need to clean and prepare the text for modeling.

Este exercício faz parte do curso

Introduction to Natural Language Processing in R

Ver curso

Instruções do exercício

  • Finalize the tokenization process by stemming the tokens.
  • Use cast_dtm() to create a document-term matrix.
  • Weight the document-term matrix using tfidf weighting.
  • Print the matrix.

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

# Stem the tokens
russian_tokens <- russian_tweets %>%
  unnest_tokens(output = "word", token = "words", input = content) %>%
  anti_join(stop_words) %>%
  ___(word = ___(word))

# Create a document term matrix using TFIDF weighting
tweet_matrix <- russian_tokens %>%
  count(tweet_id, word) %>%
  ___(document = ___, term = ___,
           value = n, weighting = tm::___)

# Print the matrix details 
___
Editar e executar o código