Make a dendrogram friendly TDM

Now that you understand the steps in making a dendrogram, you can apply them to text. But first, you have to limit the number of words in your TDM using removeSparseTerms() from tm. Why would you want to adjust the sparsity of the TDM/DTM?

TDMs and DTMs are sparse, meaning they contain mostly zeros. Remember that 1000 tweets can become a TDM with over 3000 terms! You won't be able to easily interpret a dendrogram that is so cluttered, especially if you are working on more text.

In most professional settings, a good dendrogram is based on a TDM with 25 to 70 terms. Having more than 70 terms may mean the visual will be cluttered and incomprehensible. Conversely, having less than 25 terms likely means your dendrogram may not plot relevant and insightful clusters.

When using removeSparseTerms(), the sparse parameter will adjust the total terms kept in the TDM. The closer sparse is to 1; the more terms are kept. This value represents a percentage cutoff of zeros for each term in the TDM.

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Exercise instructions

tweets_tdm has been created using the chardonnay tweets.

Print the dimensions of tweets_tdm to the console.
Create tdm1 using removeSparseTerms() with sparse = 0.95 on tweets_tdm.
Create tdm2 using removeSparseTerms() with sparse = 0.975 on tweets_tdm.
Print tdm1 to the console to see how many terms are left.
Print tdm2 to the console to see how many terms are left.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Print the dimensions of tweets_tdm
___

# Create tdm1
___ <- ___(___, ___)

# Create tdm2
___ <- ___(___, ___)

# Print tdm1
___

# Print tdm2
___

Edit and Run Code