Get startedGet started for free

Limitations of k-means clustering

1. Limitations of k-means clustering

You have learnt about k-means clustering in SciPy in earlier exercises. We will now focus on the limitations of this clustering method and how you should proceed with caution while using k-means clustering.

2. Limitations of k-means clustering

Earlier, we saw that k-means clustering overcomes the biggest drawback of hierarchical clustering, runtime. However, it comes with its own set of limitations which you should consider while using it. The first issue is the procedure to find the right number of clusters, k. As discussed earlier, the elbow method is one of the ways to determine the right k, but may not always work. The next limitation of k-means clustering is the impact of seeds on clustering, which we will explore shortly. The final limitation that we will explore is the formation of equal-sized clusters.

3. Impact of seeds

Let us look at the impact of seeds on the resulting clusters. As the process of defining the initial cluster centers is random, this initialization can affect the final clusters. Therefore, to get consistent results when running k-means clustering on the same dataset multiple times, it is a good idea to set the initialization parameters for random number generation. The seed is initialized through the seed method of random class in numpy. You can pass a single integer or a 1D array as an argument. Let us see the results of k-means clustering when we pass two different seeds before running the algorithm. For the purposes of testing, we take a list of randomly generated 200 points and use five clusters. It is seen that in the two cases the cluster sizes are different. Let us see the plots.

4. Impact of seeds: plots

Here are the plots to compare the resulting clusters. You will notice that many points along the cluster boundaries have interchanged clusters. Interestingly, the effect of seeds is only seen when the data to be clustered is fairly uniform. If the data has distinct clusters before clustering is performed, the effect of seeds will not result in any changes in the formation of resulting clusters.

5. Uniform clusters in k means

To illustrate the bias in kmeans clustering towards uniform clusters to minimize variance, let us perform clustering on this set of 280 points, divided into non uniform groups of 200, 70 and 10. Graphically, they look distinctly separated into three clusters. Therefore, if we ran any clustering algorithm, these three clusters should be picked up. Let us test that theory with kmeans clustering first.

6. Uniform clusters in k-means: a comparison

If you look at the results of k-means clustering on this dataset, you get non intuitive clusters even after varying the seeds. This is because the very idea of k-means clustering is to minimize distortions. This results in clusters that have similar areas and not necessarily the similar number of data points. However, when you look at the results of hierarchical clustering on the same dataset using the complete method to decide cluster proximity, you will notice that the clusters formed are intuitive and consistent with our assumption in the earlier slide.

7. Final thoughts

Finally, we realize that each technique has its pros and cons, and you should know about the underlying assumptions of each technique before applying them. Ideally, you should spend some time pondering over your data size, its patterns and resources and time available to you before finalizing on an algorithm. Remember, clustering is still the exploratory phase of your analysis - it is perfectly fine for some trial and error at this stage.

8. Next up: exercises

Let us work on some exercises now.