Get startedGet started for free

Feature extraction & analysis: amzn_pros

amzn_pros_corp, amzn_cons_corp, goog_pros_corp, and goog_cons_corp have all been preprocessed, so now you can extract the features you want to examine. Since you are using the bag of words approach, you decide to create a bigram TermDocumentMatrix for Amazon's positive reviews corpus, amzn_pros_corp. From this, you can quickly create a wordcloud() to understand what phrases people positively associate with working at Amazon.

The function below uses RWeka to tokenize two terms and is used behind the scenes in this exercise.

tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Exercise instructions

  • Create amzn_p_tdm as a TermDocumentMatrix from amzn_pros_corp. Make sure to add control = list(tokenize = tokenizer) so that the terms are bigrams.
  • Create amzn_p_tdm_m from amzn_p_tdm by using the as.matrix() function.
  • Create amzn_p_freq to obtain the term frequencies from amzn_p_tdm_m.
  • Create a wordcloud() using names(amzn_p_freq) as the words, amzn_p_freq as their frequencies, and max.words = 25 and color = "blue" for aesthetics.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create amzn_p_tdm
___ <- ___(
  ___,
  ___
)

# Create amzn_p_tdm_m
___ <- ___

# Create amzn_p_freq
___ <- ___

# Plot a word cloud using amzn_p_freq values
___(___)
Edit and Run Code