How many clusters?

1. How many clusters?

Hi everyone! In this video, we will explore a way to decide how many clusters are present in our data.

2. Introduction to dendrograms

Up until this point, we have graphically looked at the number of points in our datasets to decide how many clusters to form. To decide on the number of clusters in hierarchical clustering, we can use a graphical diagram called the dendrogram. A dendrogram is a branching diagram that shows the progression in a linkage object as we proceed through the hierarchical clustering algorithm. Let us look at an example.

3. Create a dendrogram in SciPy

The first step in creating a dendrogram is to import the method from scipy-dot-cluster-dot-hierarchy. Next, we use the linkage method to create a distance matrix. Finally, we use the dendrogram method and provide the linkage object as an argument, and display the plot.

4. Dendrogram demonstration

To understand the intricacies of a dendrogram, let us look at the dendrogram that has been generated and then make corresponding clusters. Recall the hierarchical clustering algorithm, where each step was a result of merging of two closest clusters in the earlier step. The x axis represents individual points, whereas the y axis represents the distance or dissimilarity between clusters. In the dendrogram, each inverted U represents a cluster divided into its two child clusters. The inverted U at the top of the figure represents a single cluster of all the data points. The width of the U shape represents the distance between the two child clusters. A wider U, therefore, means that the two child clusters were farther away from each other as compared to a narrower U in the diagram.

5. Dendrogram demonstration - 2

Now, if you draw a horizontal line at any part of the figure, the number of vertical lines it intersects tells you the number of clusters at that stage, and the distance between those vertical lines indicates the inter-cluster distance. At the horizontal line drawn on the figure, we see that there are three clusters. When you move the line below, the number of clusters increases but the inter-cluster distance decreases. This information helps us in deciding the number of clusters. For instance, even though we haven't looked at the distribution of the data points, it seems that the top three clusters have the highest distances between them. At this point, I must reiterate that there is no right metric to decide how many clusters are ideal. For instance, it looks like choosing three clusters should be ideal for this exercise. However, one's argument for two or four clusters may stand as well. Let us look at the results of each of these three cases.

6. Two clusters

Here is the result of performing the clustering with two clusters.

7. Three clusters

Here is the result with three clusters.

8. Four clusters

And here is how 4 clusters look on the data. Although the dendorgram indicated we could go ahead with three clusters, the case with four clusters makes sense too. Therefore, an additional check of visualizing the data may be performed before deciding on the number of clusters.

9. Next up - try some exercises

Now, let us try some exercises on how to decide the number of clusters using the dendrogram!