Get startedGet started for free

Customer and product segmentation basics

1. Customer and product segmentation basics

Welcome, to the last topic of this course - customer and product segmentation basics! Here we will learn how to build meaningful customer segments based on their product purchases.

2. Data format

First step - same as with the other model types - is exploring the data. We will use a wholesale dataset with customer transactions. This is a customer by product purchase matrix, that has purchase data for each customer at a product level. This is a standard way to approach customer segmentation at a product level.

3. Unsupervised learning models

There are numerous unsupervised learning models, starting from simple hierarchical clustering, to K-means, then more advanced ones like non-negative matrix factorization or biclustering, gaussian mixture models, and many more.

4. Unsupervised learning models

We will use two models in this course. K-means is a simple yet one of the most popular clustering methods that is a good way to start. We will also explore non-negative matrix factorization model which is applied in product recommendation engines, audio processing, computer vision, and even astronomy. It also works very well with sparse matrices where most of the data points are zeros. As the name implies, the data has to be non-negative. In our case, customer by product dataset is typically a sparse matrix with non-negative values only.

5. Unsupervised learning steps

Let's reiterate on the unsupervised learning modeling steps. First, as with supervised learning, we initialize the model. Then, we fit the model. Once that is done, we assign the cluster values to the original dataset. Finally, we explore the differences between clusters.

6. Explore variables

Now, let's explore the variables. We will calculate average and standard deviation for all columns, and round the results. As we can see they are quite different. Let's plot them as a barplot. We will first store the values in separate objects, then extract column names and indices using numpy's arange function, which creates a list of indices equal to the number of columns. Then, we will visualize the barplots by passing the indices first, offset them by 0.2 so they don't overlap, pass the data, color, label, and define the width of the bar. Finally, we add the x axis labels, rotate them by 90 degrees, and display the chart.

7. Bar chart of averages and standard deviations

This is much better than just looking at the table. It's clear we have large differences in both average and standard deviation values across variables.

8. Visualize pairwise plot to explore distributions

Another great exploratory tool is the pairwise plot. We can call it using seaborn's `pairplot` function, where we pass the dataframe, and make sure the charts on the diagonal are kernel density estimates abbreviated as `kde` for each variable.

9. Pairwise plot review

We see that the estimated distributions on the diagonal are highly skewed which means they are not normally distributed. We will explore options how to adjust these values in the next lesson.

10. Let's practice!

Great progress! Let's explore our data now!