Get startedGet started for free

Distance matrix and dendrogram

A simple way to do word cluster analysis is with a dendrogram on your term-document matrix. Once you have a TDM, you can call dist() to compute the differences between each row of the matrix.

Next, you call hclust() to perform cluster analysis on the dissimilarities of the distance matrix. Lastly, you can visualize the word frequency distances using a dendrogram and plot(). Often in text mining, you can tease out some interesting insights or word clusters based on a dendrogram.

Consider the table of annual rainfall that you saw in the last video. Cleveland and Portland have the same amount of rainfall, so their distance is 0. You might expect the two cities to be a cluster and for New Orleans to be on its own since it gets vastly more rain.

       city rainfall
  Cleveland    39.14
   Portland    39.14
     Boston    43.77
New Orleans    62.45

This exercise is part of the course

Text Mining with Bag-of-Words in R

View Course

Exercise instructions

The data frame rain has been preloaded in your workspace.

  • Create dist_rain by using the dist() function on the values in the second column of rain.
  • Print the dist_rain matrix to the console.
  • Create hc by performing a cluster analysis, using hclust() on dist_rain.
  • plot() the hc object with labels = rain$city to add the city names.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create dist_rain
___ <- ___(___)

# View the distance matrix
___

# Create hc
___ <- ___(___)

# Plot hc
___(___, ___)
Edit and Run Code