Changing n-grams

So far, we have only made TDMs and DTMs using single words. The default is to make them with unigrams, but you can also focus on tokens containing two or more words. This can help extract useful phrases that lead to some additional insights or provide improved predictive attributes for a machine learning algorithm.

The function below uses the RWeka package to create trigram (three word) tokens: min and max are both set to 3.

tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 3, max = 3))
}

Then the customized tokenizer() function can be passed into the TermDocumentMatrix or DocumentTermMatrix functions as an additional parameter:

tdm <- TermDocumentMatrix(
  corpus, 
  control = list(tokenize = tokenizer)
)

Questo esercizio fa parte del corso

Text Mining with Bag-of-Words in R

Visualizza il corso

Istruzioni dell'esercizio

A corpus has been preprocessed as before using the chardonnay tweets. The resulting object text_corp is available in your workspace.

Create a tokenizer function like the above which creates 2-word bigrams.
Make unigram_dtm by calling DocumentTermMatrix() on text_corp without using the tokenizer() function.
Make bigram_dtm using DocumentTermMatrix() on text_corp with the tokenizer() function you just made.
Examine unigram_dtm and bigram_dtm. Which has more terms?

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Make tokenizer function 
___ <- function(x) {
  ___(___, ___(___, ___))
}

# Create unigram_dtm
___ <- ___(___)

# Create bigram_dtm
___ <- ___(
  ___,
  ___
)

# Print unigram_dtm
___

# Print bigram_dtm
___

Modifica ed esegui il codice