1. Simple word clustering
This chapter moves beyond simple one-word frequency to give you exposure to slightly more technical text mining.
2. Hierarchical clustering example
You will start by doing a hierarchical clustering and making a dendrogram. Consider this example of annual rainfall. The rain data frame contains four cities with corresponding annual rainfall in inches. You will do a cluster analysis for the cities based on the rainfall.
The dist function applied to the city rainfall data frame calculates the pairwise distances between each city to make dist_rain. For example, Cleveland and Portland have the same rainfall so their distance is 0 while Boston gets slightly more and New Orleans gets significantly more than the other three.
3. A simple dendrogram
The resulting dist_rain object is passed to hclust to create a hierarchical cluster object.
Lastly, calling base plot on the hc object will get you a simple dendrogram. Notice how Cleveland and Portland are equal. They have no distance between them and are in fact the lowest of the four rainfall totals. Boston is slightly elevated but closer to Cleveland and Portland than to New Orleans which is the highest. It should be noted that a denrogram will reduce information. If Cleveland and Portland were separated by a small amount, even a single inch, the dendrogram would look the same.
4. Dendrogram aesthetics
Since the basic plot is not very eye-pleasing, we load the dendextend library. This is the code applied to a TDM instead of rainfall. It follows the same steps but adds branches_attr_by_labels. This function allows you to color specific branches of the dendrogram. Then you call plot the same as before. Since text dendrograms can be very busy, you may also want to add rectangles using the rect-dot-dendrogram function specifying the number of clusters, 2, and the border color "grey50".
5. Let's practice!