Choosing the number of clusters

1. Choosing number of clusters

Great work! Now we will learn how to choose the number of clusters we want to use with k-means.

2. Methods

We have learned in the previous lesson that there are several ways to do this - visually, mathematically, or experimenting with different clusters and interpreting the results. We will focus on the elbow criterion and the experimentation. First, we will learn how to identify the best advised number of segments by elbow criterion method. Then, we will experiment with several numbers of clusters around it.

3. Elbow criterion method

The elbow criterion method plots the sum of squared errors for each number of segments. The sum of squared errors is basically the sum of squared distances from each data point to their cluster center. We then look at the chart to identify where the decrease in SSE slows down and becomes somewhat marginal. That point looks like an elbow of a bended arm and it shows where there are diminishing returns by increasing the number of clusters. This point represents the optimal number of clusters from a sum-of-squared errors perspective. However, we should choose several options around the elbow to test what makes most sense.

4. Elbow criterion method

Let's take a look at the code to build the elbow criterion plot. We are using a dummy data to emphasize the concept. First, we import the key libraries for plotting and kmeans. Then, we create an empty dictionary called SSE, abbreviation for sum of squared errors, and we run a for loop over different number of clusters, between 1 and 10. For each iteration we build a kmeans segmentation on the pre-processed data. The fitted kmeans model already has a sum of squared errors calculated and stored as inertia. We will assign it to the dictionary we created. Finally we plot the number of clusters stored as keys in the dictionary on the x axis, and the sum-of-squared error values on the y axis.

5. Elbow criterion method

And here's what we get: the elbow plot! The way to look at it is try to find the point with the largest angle which is the so-called elbow.

6. Elbow criterion method

The point at 4 clusters is where we identify the largest angle, and this is the elbow we've been looking for.

7. Using elbow criterion method

It's important to understand that building segmentation only at that number of clusters is not a hard rule. Generally, it should be taken as a recommended number and we should test several segmentation approaches. Here we can see the elbow plot for our RFM data. It's pretty clear the elbow is at the 2 cluster solution. But we will definitely test several approaches.

8. Experimental approach - analyze segments

Finally, there is the experimental approach to choose the number of customers which is best used after identifying an elbow or another computationally advised number of segments. For each of these numbers we will calculate the average RFM or other attribute values, and compare the solutions to identify the approach that is the most useful and provides most insight.

9. Experimental approach - analyze segments

For example, the elbow criterion plot for RFM data advised 2 clusters, therefore, we should at least build segmentation based on 2 and 3 clusters, and compare the outputs. As we can see, the 3 segment solution still has more story to it. While it does identify the least attractive segments and cluster number 1, it does break down the higher value segments into two: segment zero and segment two. It's up to the analyst and the business partners to review the segments and make the call on which solutions make more sense.

10. Let's practice finding the optimal number of clusters!

Now, you will run some calculations to build the elbow criterion plot!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.