1. Basics of k-means clustering
Hi everyone! Now that you are familiar with hierarchical clustering, let us move on to k-means clustering. In the first chapter, we had a look at the algorithm behind k-means clustering - in this chapter, we will focus on the various parameters and their implications on the clustering results. Let's get started!
2. Why k-means clustering?
We explored a critical issue in hierarchical clustering in the last chapter - runtime. This chapter discusses a new clustering technique, K-means clustering, which allows you to cluster large datasets in a fraction of the time.
3. Step 1: Generate cluster centers
To perform K-Means clustering in scipy, there are two steps involved - generate the cluster centers and then assign the cluster labels. The first step is performed by the kmeans method. There are five arguments for this method.
The first argument is the list of observations, which have been standardized through the whiten method.
The second argument, k_or_guess, is the number of clusters.
The next argument is the number of iterations of the algorithm to perform. Its default value is 20.
The fourth argument is the threshold. The idea behind this argument is that the algorithm is terminated if the change in distortion since the last k-means iteration is less than or equal to the threshold. Its default value is 10 raised to the power minus 5, or 0-point-00001.
The last argument is a boolean value indicating if a check needs to be performed on the data for the presence of infinite or NaN values. The default value is True, which ensures that data points with NaN or infinite values are not considered for classification, which ensures that the results are accurate and unbiased.
The k-means function returns two arguments, the cluster centers and distortion. The cluster centers, is also known as the code book.
You will notice that k-means runs really quickly as compared to hierarchical clustering as the number of operations is considerably less in k-means clustering.
4. How is distortion calculated?
The distortion is calculated as the sum of square of distances between the data points and cluster centers, as demonstrated in this figure.
5. Step 2: Generate cluster labels
The next step is to use the vq method to generate cluster labels. It takes three arguments.
The first argument is the list of observations, which have been standardized through the whiten method.
The second argument is the code book, that is the first output of the kmeans method.
The third optional argument is check_finite, a boolean value indicating if a check needs to be performed on the data for the presence of infinite or NaN values. By default, its value is set to True.
The function returns the cluster labels, also known as the "code book index" and the distortion.
6. A note on distortions
Let us explore distortions further. kmeans returns a single value of distortions based on the overall data, whereas vq returns a list of distortions, one for each data point.
The mean of the list of distortions from the vq method should approximately equal the distortion value of the kmeans method if the same list of observations is passed.
7. Running k-means
Let us run k-means in Python. First, we import kmeans and vq. Then, we use the kmeans to get cluster centers and vq to get cluster labels. Then, we display a scatter plot with seaborn.
8. Seaborn plot
Here is how the resultant plot looks like. Notice the three distinct clusters in the figure.
9. Next up: exercises!
Now that you are familiar with kmeans clustering in scipy, let's test your knowledge through some exercises.