Put it all together: a text-based dendrogram
Its time to put your skills to work to make your first text-based dendrogram. Remember, dendrograms reduce information to help you make sense of the data. This is much like how an average tells you something, but not everything, about a population. Both can be misleading. With text, there are often a lot of nonsensical clusters, but some valuable clusters may also appear.
A peculiarity of TDM and DTM objects is that you have to convert them first to matrices (with as.matrix()
), before using them with the dist()
function.
For the chardonnay tweets, you may have been surprised to see the soul music legend Marvin Gaye appears in the word cloud. Let's see if the dendrogram picks up the same.
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
- Create
tweets_tdm2
by applyingremoveSparseTerms()
ontweets_tdm
. Usesparse = 0.975
. - Create
tdm_m
by usingas.matrix()
ontweets_tdm2
to convert it to matrix form. - Create
tweets_dist
containing the distances oftdm_m
using thedist()
function. - Create a hierarchical cluster object called
hc
usinghclust()
ontweets_dist
. - Make a dendrogram with
plot()
andhc
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create tweets_tdm2
___ <- ___(___, ___)
# Create tdm_m
___ <- ___(___)
# Create tweets_dist
___ <- ___(___)
# Create hc
___ <- ___(___)
# Plot the dendrogram
___(___)