Changing n-grams
So far, we have only made TDMs and DTMs using single words. The default is to make them with unigrams, but you can also focus on tokens containing two or more words. This can help extract useful phrases that lead to some additional insights or provide improved predictive attributes for a machine learning algorithm.
The function below uses the RWeka
package to create trigram (three word) tokens: min
and max
are both set to 3
.
tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
Then the customized tokenizer()
function can be passed into the TermDocumentMatrix
or DocumentTermMatrix
functions as an additional parameter:
tdm <- TermDocumentMatrix(
corpus,
control = list(tokenize = tokenizer)
)
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
A corpus
has been preprocessed as before using the chardonnay tweets. The resulting object text_corp
is available in your workspace.
- Create a
tokenizer
function like the above which creates 2-word bigrams. - Make
unigram_dtm
by callingDocumentTermMatrix()
ontext_corp
without using thetokenizer()
function. - Make
bigram_dtm
usingDocumentTermMatrix()
ontext_corp
with thetokenizer()
function you just made. - Examine
unigram_dtm
andbigram_dtm
. Which has more terms?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make tokenizer function
___ <- function(x) {
___(___, ___(___, ___))
}
# Create unigram_dtm
___ <- ___(___)
# Create bigram_dtm
___ <- ___(
___,
___
)
# Print unigram_dtm
___
# Print bigram_dtm
___