Get startedGet started for free

Build customer and product segmentation

1. Build customer and product segmentation

Great! Now, we will build customer segments based on their product purchases.

2. Segmentation steps with K-means

Building segmentation with k-means is fairly easy. We import the KMeans module from sklearn.cluster library. Then, we initialize the KMeans instance with a certain pre-defined number of clusters k. After that, we fit the model on the pre-processed dataset. Finally, we use the assign method to create a new segment label column in the original un-processed dataset. Remember, it's important that we create this new column in the original dataset, not in the pre-processed one, as we will use the values in their original scale to analyze the segments in the next lesson.

3. Segmentation steps with NMF

The segmentation process with non-negative matrix factorization is almost as easy as with K-means. First, we import the NMF module, initialize it with a pre-defined number of clusters k, fit it on the raw - not pre-processed - dataset, and extract the components by calling the components underscore argument. And then, storing it as pandas dataframe with column names from the original wholesale dataset. The components dataset is a segment by product matrix with segment weights. We will explore it later to understand the meaning of each segment compared to how they simplify the product purchase patterns into groups. We can get cluster assignment by extracting the cluster weights for each of the customer. We can do this by calling the transform method on the fitted nmf model, and assigning the column names (or cluster labels) from the components object index. Finally, we assign the customer index from the original wholesale dataset, and then create a new column in the original dataset by calculating which cluster weight is the largest for each customer, and choosing that label by calling idxmax function.

4. How to initialize the number of segments?

One thing we didn't cover previously, is how to initialize the number of clusters k? Both K-means and non-negative matrix factorization require that value to be set beforehand. There are two ways to define k - either mathematically or by testing different values and exploring the results. We will first test the mathematical approach by using the elbow criterion method. This will get us a ball-park estimate of what is the optimal number of clusters.

5. Elbow criterion method

The way it works is we iterate through a number of k cluster values, for example, between 2 and 10. Then, we run KMeans clustering for each value on the same data. Within each run, we calculate the sum of squared errors to see how much it is reduced as we add more segments. Finally, we plot the sum of squared errors again k to identify the so-called elbow - where the decrease in the errors slows down, and there's only incremental reduction with more segments past that point.

6. Calculate sum of squared errors and plot the results

Doing this with python is straightforward. First we initialize an empty dictionary called sse. Then, we run a loop through k values between 1 and 11, and store the sum of squared errors that can be accessed with inertia underscore argument from the fitted kmeans model. Finally, we plot the k values or keys argument of the sse dictionary on the x axis, and the sum of squared errors or values argument in the sse dictionary on the y axis.

7. Identifying the optimal number of segments

As we can see, the elbow is somewhere around 2 and 3. When building the model we should start with these k values incremented by 1, meaning that the optimal number of clusters mathematically is around 3 or 4.

8. Test & learn method

While the mathematical approach is a good start, the main goal of segmentation is to building meaningful representation of the customer base into similar groups, that can be interpreted and used in customized marketing or product offerings. So we will have to do some exploration. Typically, we first calculate the optimal number of segments mathematically. Then, we build the segmentation with multiple values around the optimal k. Finally, we explore the results and choose the one with most business relevance. The rule of thumb for business relevance is asking yourself whether you could give the segments a name while looking at their characteristics. Also, we should observe, whether there's any ambiguity or overlap between the segments. We will explore this in more depth in the next lesson.

9. Let's build customer segments!

Fantastic, now let's go build some customer segments!