Word association
As expected, you see similar topics throughout the dendrogram. Switching back to positive comments, you decide to examine top phrases that appeared in the word clouds. You hope to find associated terms using the findAssocs()
function from tm
. You want to check for something surprising now that you have learned of long hours and a lack of work-life balance.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
The amzn_pros_corp
corpus has been cleaned using the custom functions like before.
- Construct a TDM called
amzn_p_tdm
fromamzn_pros_corp
andcontrol = list(tokenize = tokenizer)
. - Create
amzn_p_m
by convertingamzn_p_tdm
to a matrix. - Create
amzn_p_freq
by applyingrowSums()
toamzn_p_m
. - Create
term_frequency
usingsort()
onamzn_p_freq
along with the argumentdecreasing = TRUE
. - Examine the first 5 bigrams using
term_frequency[1:5]
. - You may be surprised to see "fast paced" as a top term because it could be a negative term related to "long hours". Look at the terms most associated with "fast paced". Use
findAssocs()
onamzn_p_tdm
to examine"fast paced"
with a0.2
cutoff.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create amzn_p_tdm
___ <- ___(
___,
___
)
# Create amzn_p_m
___ <- ___
# Create amzn_p_freq
___ <- ___
# Create term_frequency
___ <- ___
# Print the 5 most common terms
___
# Find associations with fast-paced
___