Using word association
Another way to think about word relationships is with the findAssocs()
function in the tm
package. For any given word, findAssocs()
calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together in documents, while a score approaching 0 means the terms seldom appear in the same document.
Keep in mind the calculation for findAssocs()
is done at the document level. So for every document that contains the word in question, the other terms in those specific documents are associated. Documents without the search term are ignored.
To use findAssocs()
pass in a TDM or DTM, the search term, and a minimum correlation. The function will return a list of all other terms that meet or exceed the minimum threshold.
findAssocs(tdm, "word", 0.25)
Minimum correlation values are often relatively low because of word diversity. Don't be surprised if 0.10
demonstrates a strong pairwise term association.
The coffee tweets have been cleaned and organized into tweets_tdm
for the exercise. You will search for a term association, and manipulate the results with list_vect2df()
from qdap
and then create a plot with the ggplot2
code in the example script.
This is a part of the course
“Text Mining with Bag-of-Words in R”
Exercise instructions
- Create
associations
usingfindAssocs()
ontweets_tdm
to find terms associated with "venti", which meet a minimum threshold of0.2
. - View the terms associated with "venti" by printing
associations
to the console. - Create
associations_df
, by callinglist_vect2df()
, passingassociations
, then settingcol2
to"word"
andcol3
to"score"
. - Run the
ggplot2
code to make a dot plot of the association values.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create associations
___ <- ___(___, ___, ___)
# View the venti associations
___
# Create associations_df
___ <- ___(___, ___, ___)
# Plot the associations_df values
ggplot(associations_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
This exercise is part of the course
Text Mining with Bag-of-Words in R
Learn the bag of words technique for text mining with R.
In this chapter, you'll learn more basic text mining techniques based on the bag of words method.
Exercise 1: Simple word clusteringExercise 2: Test your understanding of text miningExercise 3: Distance matrix and dendrogramExercise 4: Make a dendrogram friendly TDMExercise 5: Put it all together: a text-based dendrogramExercise 6: Dendrogram aestheticsExercise 7: Using word associationExercise 8: Getting past single wordsExercise 9: N-gram tokenizationExercise 10: Changing n-gramsExercise 11: How do bigrams affect word clouds?Exercise 12: Different frequency criteriaExercise 13: Changing frequency weightsExercise 14: Capturing metadata in tmWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.