Cage match! Amazon vs. Google pro reviews
Amazon's positive reviews appear to mention bigrams such as "good benefits", while its negative reviews focus on bigrams such as "workload" and "work-life balance" issues.
In contrast, Google's positive reviews mention "great food", "perks", "smart people", and "fun culture", among other things. Google's negative reviews discuss "politics", "getting big", "bureaucracy", and "middle management".
You decide to make a pyramid plot lining up positive reviews for Amazon and Google so you can compare the differences between any shared bigrams.
We have preloaded a data frame, all_tdm_df
, consisting of terms
and corresponding AmazonPro
, and GooglePro
bigram frequencies. Using this data frame, you will identify the top 5 bigrams that are shared between the two corpora.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Create
common_words
fromall_tdm_df
usingdplyr
functions.filter()
on theAmazonPro
column for nonzero values.- Likewise filter the
GooglePro
column for nonzero values. - Then
mutate()
a new column,diff
which is theabs
(absolute) difference between the term frequencies columns.
- Pipe
common_words
intoslice_max
to createtop5_df
referencing thediff
column and top5
values. It will print to your console for review. - Create a
pyramid.plot
passing intop5_df$AmazonPro
thentop5_df$GooglePro
and finally add labels withtop5_df$terms
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Filter to words in common and create an absolute diff column
common_words <- all_tdm_df %>%
filter(
___ != 0,
___ != 0
) %>%
___(diff = ___(___ - ___))
# Extract top 5 common bigrams
(top5_df <- common_words %>% ___(___, n = ___))
# Create the pyramid plot
pyramid.plot(top5_df$___, top5_df$___,
labels = top5_df$___, gap = 12,
top.labels = c("Amzn", "Pro Words", "Goog"),
main = "Words in Common", unit = NULL)