Feature extraction & analysis: amzn_pros
amzn_pros_corp, amzn_cons_corp, goog_pros_corp, and goog_cons_corp have all been preprocessed, so now you can extract the features you want to examine. Since you are using the bag of words approach, you decide to create a bigram TermDocumentMatrix for Amazon's positive reviews corpus, amzn_pros_corp. From this, you can quickly create a wordcloud() to understand what phrases people positively associate with working at Amazon.
The function below uses RWeka to tokenize two terms and is used behind the scenes in this exercise.
tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Create
amzn_p_tdmas aTermDocumentMatrixfromamzn_pros_corp. Make sure to addcontrol = list(tokenize = tokenizer)so that the terms are bigrams. - Create
amzn_p_tdm_mfromamzn_p_tdmby using theas.matrix()function. - Create
amzn_p_freqto obtain the term frequencies fromamzn_p_tdm_m. - Create a
wordcloud()usingnames(amzn_p_freq)as the words,amzn_p_freqas their frequencies, andmax.words = 25andcolor = "blue"for aesthetics.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create amzn_p_tdm
___ <- ___(
___,
___
)
# Create amzn_p_tdm_m
___ <- ___
# Create amzn_p_freq
___ <- ___
# Plot a word cloud using amzn_p_freq values
___(___)