Get startedGet started for free

How many clusters?

1. How many clusters?

In the earlier chapter, we analyzed the dendrogram to determine how many clusters were present in the data. This video talks of a way of determining the number of clusters in k-means clustering.

2. How to find the right k?

One critique of k-means clustering is that there is no right way of finding out how many clusters exist in your dataset. There are certain indicative methods, and this chapter discusses one such method: constructing an elbow plot to decide the right number of clusters for your dataset.

3. Distortions revisited

Recall our discussion on distortions, The distortion is the sum of the squares of distances between each data point and its cluster center. Ideally, distortion has an inverse relationship with the number of clusters - which means that distortion decreases with increasing number of clusters. This trend is intuitive - as segmenting the data into smaller fragments will lead to clusters being closer together, leading to a lower distortion This is the underlying logic of the elbow method, which is a line plot between the number of clusters and their corresponding distortions.

4. Elbow method

We first run k-means clustering with a varying number of clusters on the data, and construct an elbow plot, which has the number of clusters on the x-axis and distortion on the y-axis. The number of clusters can start from one to the number of data points. The ideal point is one beyond which the distortion decreases relatively less on increasing the number of clusters. Let us look at the code and a sample plot to better understand how to do this.

5. Elbow method in Python

In this code, we prepare the data to construct an elbow plot. To do so, we first decide the range of number of clusters that we would like to run the algorithm for. In this case, cluster sizes range from 2 to 6. Next, we collect the distortion from each run of the k-means method and plot the two lists using seaborn. We run the k-means method for each cluster and collect the corresponding distortions in a list for use later. In the final step, we create a DataFrame with the distortions for each number of clusters and plot it using seaborn, with number of clusters on x axis and distortion on y axis.

6. Sample elbow plot

This is a sample elbow plot. You would notice that distortion decreases sharply from 2 to 3 clusters, but has a very gradual decrease with a subsequent increase in number of clusters. The ideal number of clusters here is therefore, 3.

7. Final thoughts on using the elbow method

Before completing this video, I would like to emphasize that the elbow method only gives an indication of ideal number of clusters. Occasionally, it may be insufficient to find an optimal k. For instance, the elbow method fails when data is evenly distributed. There are other methods to find the optimal number of clusters such as the average silhouette and gap statistic methods. They are indicative methods too, and will not be discussed as a part of this course.

8. Next up: exercises

Now that you have knowledge of using the elbow method to determine the number of clusters, let us move on to some exercises to find the optimal number of clusters.