Feature extraction & analysis: amzn_pros

amzn_pros_corp, amzn_cons_corp, goog_pros_corp, and goog_cons_corp have all been preprocessed, so now you can extract the features you want to examine. Since you are using the bag of words approach, you decide to create a bigram TermDocumentMatrix for Amazon's positive reviews corpus, amzn_pros_corp. From this, you can quickly create a wordcloud() to understand what phrases people positively associate with working at Amazon.

The function below uses RWeka to tokenize two terms and is used behind the scenes in this exercise.

tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

Create amzn_p_tdm as a TermDocumentMatrix from amzn_pros_corp. Make sure to add control = list(tokenize = tokenizer) so that the terms are bigrams.
Create amzn_p_tdm_m from amzn_p_tdm by using the as.matrix() function.
Create amzn_p_freq to obtain the term frequencies from amzn_p_tdm_m.
Create a wordcloud() using names(amzn_p_freq) as the words, amzn_p_freq as their frequencies, and max.words = 25 and color = "blue" for aesthetics.

Jumping into Text Mining with Bag-of-Words

Word Clouds and More Interesting Visuals

Adding to Your TM Skills

Battle of the Tech Giants for Talent

Ubung

Feature extraction & analysis: amzn_pros

Anweisungen