Get startedGet started for free

Practical implementation of k-means clustering

1. Practical implementation of k-means clustering

Great job! Now we are entering the final and the most important part of this course - we will implement a segmentation project with k-means clustering using the data we have prepared in the previous lessons.

2. Key steps

The key steps of the segmentation projects are the following ones: First, we pre-process the data, which we covered in the previous lessons. Then, we have to choose the number of clusters as k-means requires that number to be passed to it. We will then run the k-means clustering and receive a list of cluster labels for each customer. Finally, we will analyze the average recency, frequency and monetary values for each cluster and compare them.

3. Data pre-processing

As we have already covered data pre-processing in the previous lesson, we will just reiterate the learnings: We have the raw and the pre-processed datasets loaded as datamart_rfm and datamart_normalized respectively. The code that we used to create the normalized version first unskewed the data with a log transformation. It then normalized it with the StandardScaler() which ultimately centers the data by subtracting its mean values. Finally, it scales it by dividing it by standard deviation.

4. Methods to define the number of clusters

When running k-means you will have to pass the number of clusters. This decision is not an easy one, if there are no data supporting the choice. Fortunately, there are various methods to get a good estimate on what it should be: We will use visual methods such as elbow criterion since they are easy to interpret and give a good estimation. There are also mathematical methods like silhouette coefficient which is useful when trying to find a model with better defined clusters. Not without its caveats, but still commonly used. We won't be using this method for our segmentation project. Finally, it's important to understand that these methods should be interpreted as advisory as we are segmenting customers. Each solution's interpretation should make sense for the business first, and be actionable.

5. Running k-means

Running k-means is a pretty straightforward process: First we import KMeans from the scikit-learn library. Then we initialize the model by passing the number of clusters and any integer as the random state. Then, we compute the k-means clustering on our pre-processed data with the fit() function. Finally we extract the computer cluster labels. That's it!

6. Analyzing average RFM values of each cluster

The next step is to analyze how these clusters differ from each other, and we will do that with the raw data. First, we will create a new dataframe that has raw RFM columns, plus a cluster column which we add with the assign() command. Then we take the new DataFrame and calculate average RFM values for each cluster, and count the number of observations in each cluster.

7. Analyzing average RFM values of each cluster

The result is a simple table where we see how these two segments differ from each other. It's clear that segment 0 has customers who have not been very recent, are much less frequent with their purchases and their monetary value is much lower than that of segment 1. This is already a clear distinction and quite a useful segmentation. We will see in the next lesson that more insights can be uncovered by increasing the number of segments.

8. Let's practice running k-means clustering!

Now it's your turn to practice running k-means clustering algorithm!