Distance matrix and dendrogram
A simple way to do word cluster analysis is with a dendrogram on your term-document matrix. Once you have a TDM, you can call dist()
to compute the differences between each row of the matrix.
Next, you call hclust()
to perform cluster analysis on the dissimilarities of the distance matrix. Lastly, you can visualize the word frequency distances using a dendrogram and plot()
. Often in text mining, you can tease out some interesting insights or word clusters based on a dendrogram.
Consider the table of annual rainfall that you saw in the last video. Cleveland and Portland have the same amount of rainfall, so their distance is 0. You might expect the two cities to be a cluster and for New Orleans to be on its own since it gets vastly more rain.
city rainfall
Cleveland 39.14
Portland 39.14
Boston 43.77
New Orleans 62.45
This exercise is part of the course
Text Mining with Bag-of-Words in R
Exercise instructions
The data frame rain
has been preloaded in your workspace.
- Create
dist_rain
by using thedist()
function on the values in the second column ofrain
. - Print the
dist_rain
matrix to the console. - Create
hc
by performing a cluster analysis, usinghclust()
ondist_rain
. plot()
thehc
object withlabels = rain$city
to add the city names.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create dist_rain
___ <- ___(___)
# View the distance matrix
___
# Create hc
___ <- ___(___)
# Plot hc
___(___, ___)