Feature extraction & analysis: amzn_pros
amzn_pros_corp
, amzn_cons_corp
, goog_pros_corp
, and goog_cons_corp
have all been preprocessed, so now you can extract the features you want to examine. Since you are using the bag of words approach, you decide to create a bigram TermDocumentMatrix
for Amazon's positive reviews corpus, amzn_pros_corp
. From this, you can quickly create a wordcloud()
to understand what phrases people positively associate with working at Amazon.
The function below uses RWeka
to tokenize two terms and is used behind the scenes in this exercise.
tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Create
amzn_p_tdm
as aTermDocumentMatrix
fromamzn_pros_corp
. Make sure to addcontrol = list(tokenize = tokenizer)
so that the terms are bigrams. - Create
amzn_p_tdm_m
fromamzn_p_tdm
by using theas.matrix()
function. - Create
amzn_p_freq
to obtain the term frequencies fromamzn_p_tdm_m
. - Create a
wordcloud()
usingnames(amzn_p_freq)
as the words,amzn_p_freq
as their frequencies, andmax.words = 25
andcolor = "blue"
for aesthetics.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create amzn_p_tdm
___ <- ___(
___,
___
)
# Create amzn_p_tdm_m
___ <- ___
# Create amzn_p_freq
___ <- ___
# Plot a word cloud using amzn_p_freq values
___(___)