Make a dendrogram friendly TDM
Now that you understand the steps in making a dendrogram, you can apply them to text. But first, you have to limit the number of words in your TDM using removeSparseTerms()
from tm
. Why would you want to adjust the sparsity of the TDM/DTM?
TDMs and DTMs are sparse, meaning they contain mostly zeros. Remember that 1000 tweets can become a TDM with over 3000 terms! You won't be able to easily interpret a dendrogram that is so cluttered, especially if you are working on more text.
In most professional settings, a good dendrogram is based on a TDM with 25 to 70 terms. Having more than 70 terms may mean the visual will be cluttered and incomprehensible. Conversely, having less than 25 terms likely means your dendrogram may not plot relevant and insightful clusters.
When using removeSparseTerms()
, the sparse
parameter will adjust the total terms kept in the TDM. The closer sparse
is to 1; the more terms are kept. This value represents a percentage cutoff of zeros for each term in the TDM.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
tweets_tdm
has been created using the chardonnay tweets.
- Print the dimensions of
tweets_tdm
to the console. - Create
tdm1
usingremoveSparseTerms()
withsparse = 0.95
ontweets_tdm
. - Create
tdm2
usingremoveSparseTerms()
withsparse = 0.975
ontweets_tdm
. - Print
tdm1
to the console to see how many terms are left. - Print
tdm2
to the console to see how many terms are left.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print the dimensions of tweets_tdm
___
# Create tdm1
___ <- ___(___, ___)
# Create tdm2
___ <- ___(___, ___)
# Print tdm1
___
# Print tdm2
___