Distance matrix and dendrogram

A simple way to do word cluster analysis is with a dendrogram on your term-document matrix. Once you have a TDM, you can call dist() to compute the differences between each row of the matrix.

Next, you call hclust() to perform cluster analysis on the dissimilarities of the distance matrix. Lastly, you can visualize the word frequency distances using a dendrogram and plot(). Often in text mining, you can tease out some interesting insights or word clusters based on a dendrogram.

Consider the table of annual rainfall that you saw in the last video. Cleveland and Portland have the same amount of rainfall, so their distance is 0. You might expect the two cities to be a cluster and for New Orleans to be on its own since it gets vastly more rain.

       city rainfall
  Cleveland    39.14
   Portland    39.14
     Boston    43.77
New Orleans    62.45

The data frame rain has been preloaded in your workspace.

Create dist_rain by using the dist() function on the values in the second column of rain.
Print the dist_rain matrix to the console.
Create hc by performing a cluster analysis, using hclust() on dist_rain.
plot() the hc object with labels = rain$city to add the city names.

Jumping into Text Mining with Bag-of-Words

Word Clouds and More Interesting Visuals

Adding to Your TM Skills

Battle of the Tech Giants for Talent

Ejercicio

Distance matrix and dendrogram

Instrucciones