Get Started

Using word association

Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together in documents, while a score approaching 0 means the terms seldom appear in the same document.

Keep in mind the calculation for findAssocs() is done at the document level. So for every document that contains the word in question, the other terms in those specific documents are associated. Documents without the search term are ignored.

To use findAssocs() pass in a TDM or DTM, the search term, and a minimum correlation. The function will return a list of all other terms that meet or exceed the minimum threshold.

findAssocs(tdm, "word", 0.25)

Minimum correlation values are often relatively low because of word diversity. Don't be surprised if 0.10 demonstrates a strong pairwise term association.

The coffee tweets have been cleaned and organized into tweets_tdm for the exercise. You will search for a term association, and manipulate the results with list_vect2df() from qdap and then create a plot with the ggplot2 code in the example script.

This is a part of the course

“Text Mining with Bag-of-Words in R”

View Course

Exercise instructions

  • Create associations using findAssocs() on tweets_tdm to find terms associated with "venti", which meet a minimum threshold of 0.2.
  • View the terms associated with "venti" by printing associations to the console.
  • Create associations_df, by calling list_vect2df(), passing associations, then setting col2 to "word" and col3 to "score".
  • Run the ggplot2 code to make a dot plot of the association values.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create associations
___ <- ___(___, ___, ___)

# View the venti associations
___

# Create associations_df
___ <- ___(___, ___, ___)

# Plot the associations_df values
ggplot(associations_df, aes(score, word)) + 
  geom_point(size = 3) + 
  theme_gdocs()

This exercise is part of the course

Text Mining with Bag-of-Words in R

IntermediateSkill Level
5.0+
7 reviews

Learn the bag of words technique for text mining with R.

In this chapter, you'll learn more basic text mining techniques based on the bag of words method.

Exercise 1: Simple word clusteringExercise 2: Test your understanding of text miningExercise 3: Distance matrix and dendrogramExercise 4: Make a dendrogram friendly TDMExercise 5: Put it all together: a text-based dendrogramExercise 6: Dendrogram aestheticsExercise 7: Using word association
Exercise 8: Getting past single wordsExercise 9: N-gram tokenizationExercise 10: Changing n-gramsExercise 11: How do bigrams affect word clouds?Exercise 12: Different frequency criteriaExercise 13: Changing frequency weightsExercise 14: Capturing metadata in tm

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free