1. Customer and product segmentation basics
Welcome, to the last topic of this course - customer and product segmentation basics! Here we will learn how to build meaningful customer segments based on their product purchases.
2. Data format
First step - same as with the other model types - is exploring the data.
We will use a wholesale dataset with customer transactions. This is a customer by product purchase matrix, that has purchase data for each customer at a product level. This is a standard way to approach customer segmentation at a product level.
3. Unsupervised learning models
There are numerous unsupervised learning models, starting from simple
hierarchical clustering, to
K-means,
then more advanced ones like non-negative matrix factorization or
biclustering,
gaussian mixture models,
and many more.
4. Unsupervised learning models
We will use two models in this course.
K-means is a simple yet one of the most popular clustering methods that is a good way to start.
We will also explore non-negative matrix factorization model which is applied in product recommendation engines, audio processing, computer vision, and even astronomy. It also works very well with sparse matrices where most of the data points are zeros. As the name implies, the data has to be non-negative.
In our case, customer by product dataset is typically a sparse matrix with non-negative values only.
5. Unsupervised learning steps
Let's reiterate on the unsupervised learning modeling steps. First, as with supervised learning, we initialize the model. Then, we fit the model. Once that is done, we assign the cluster values to the original dataset. Finally, we explore the differences between clusters.
6. Explore variables
Now, let's explore the variables. We will calculate average and standard deviation for all columns, and round the results.
As we can see they are quite different.
Let's plot them as a barplot. We will first store the values in separate objects, then extract column names and indices using numpy's arange function, which creates a list of indices equal to the number of columns.
Then, we will visualize the barplots by passing the indices first, offset them by 0.2 so they don't overlap, pass the data, color, label, and define the width of the bar.
Finally, we add the x axis labels, rotate them by 90 degrees, and display the chart.
7. Bar chart of averages and standard deviations
This is much better than just looking at the table. It's clear we have large differences in both average and standard deviation values across variables.
8. Visualize pairwise plot to explore distributions
Another great exploratory tool is the pairwise plot. We can call it using seaborn's `pairplot` function, where we pass the dataframe, and make sure the charts on the diagonal are kernel density estimates abbreviated as `kde` for each variable.
9. Pairwise plot review
We see that the estimated distributions on the diagonal are highly skewed which means they are not normally distributed. We will explore options how to adjust these values in the next lesson.
10. Let's practice!
Great progress! Let's explore our data now!