1. Selecting number of clusters
Now that you have created your first hierarchical cluster model in R, let's take the next step in how to interpret and use a hierarchical cluster model.
2. Interpreting results
If you look at the summary of the hierarchical clustering model the output is somewhat technical and opaque. In all honesty it's not that useful of a summary.
3. Dendrogram
To remedy this issue, first let's build a little more intuition about the hierarchical clustering algorithm.
I will be using the same example of 5 observations each with the 2 features you've seen before. These are color-coded and presented in the 2-dimensional plane on the left.
The same five points are also presented on the right. On this side we will be building a tree, a dendrogram, which represents the clustering process.
4. Dendrogram
As described before, every observation is made a cluster. Then the closest two clusters are joined together into a single cluster. This is equivalent to two points being joined on the tree representation of the clusters.
5. Dendrogram
This process continues, finding the closest two clusters and joining them into a new cluster. The distance between the clusters is represented as the height of the horizontal line on the dendrogram.
6. Dendrogram
The next iteration then joins the next two closest clusters.
7. Dendrogram
This algorithm continues until only one cluster is remaining. This also completes the tree representation, the dendrogram, of the results of the hierarchical clustering algorithm.
8. Dendrogram plotting in R
To create the dendrogram in R, the output of the hclust() function, the model, is passed into R's plot() function.
The next step typical in hierarchical clustering is to determine the number of clusters you want in the model. This is one of the key model selection steps for this algorithm. A way to think about this is as drawing a cut line at a particular 'height' or distance, between the clusters.
Choosing the number of clusters based on distance between the clusters is equivalent to drawing a line on the dendrogram at a height equal to the desired distance between clusters. This is done using the abline() function in R, using the 'h' parameter to specify the height to draw the line, and optionally a color for the line, using the parameter 'c-o-l'.
Here I show the results of abline() with a horizontal red line. Specifying height of the line is the equivalent of specifying that you want clusters that are no further apart than that height. Distance between cluster can be any metric, but throughout this course we will be using Euclidean distance.
In this example, the result is two clusters with the blue, purple, and orange observations assigned to cluster 1 and the red and green observations assigned to cluster 2.
9. Tree "cutting" in R
Finally, to make cluster assignments for each observation in the cluster, you can use the "cut tree" function in R. The "cut tree" function takes as its parameters the hierarchical cluster model and either the height at which to cut the dendrogram tree, the 'h' parameter, or the number of clusters you want to maintain, the 'k' parameter. The results are a vector with a numeric cluster assignment for each observation.
10. Let's practice!
Ok, now it's your turn to practice what you've learned.