Data preprocessing

1. Data preprocessing

Hi there! In this chapter we are going to learn about the key data pre-processing steps and considerations. We will then run k-means clustering on our RFM numbers and identify customer segments.

2. Advantages of k-means clustering

We are going to use K-means clustering because of its several advantages. First, it is one of the most popular unsupervised learning method, and has been researched a lot, so you can find answers and tips on almost any question about it. Also, it is a pretty fast algorithm compared to slower ones that do not work very well with large datasets - especially not on your local machine. And finally, it does its job well, as long as your assumptions about the data are correct. We will review them in this lesson, and then learn how to pre-process your data to get the most out of k-means.

3. Key k-means assumptions

Although this is not an exhaustive list, these are the make-it-or-break-it assumptions that are absolutely critical to address before the clustering. The first assumption is that all variables have symmetrical distributions. By definition, this means the distribution is not skewed. We will see examples of both skewed and non skewed distributions in the next slide. The second assumption is that all variables have the same average values. This is key to ensuring that each metric gets an equal weight in the k-means calculation. You will learn how to get every variable to the same mean. And finally, we will have to scale the variance of each variable to the same levels. Same as with the averages, it helps the algorithm to converge and ensures equal importance is assigned to each variable.

4. Skewed variables

Let's take a look at how to identify skewed variables. The best way to identify them is by looking at each variable's distribution - the skewness will show up as a curve with a tail. For example this is a left-skewed distribution with a tail on the left. And this here is a right-skewed distribution. While there are mathematical ways to identify skewness, the visual analysis is somewhat more accessible. It allows you to not only identify skewness, but also understand the distribution and the spread of data.

5. Skewed variables

Skewness is best managed by applying a logarithmic transformation to each of the skewed variables. Its distribution then becomes more symmetrical. One thing to point out though - log transformation works only on positive values. Fortunately, this is mostly the case with customer behavior or purchasing patterns. You will learn about alternative options to manage variables with negative values in the next lesson.

6. Variables on the same scale

Finally, K-means assumes each variable has equal average value, and equal variance. We will use recency, frequency, and monetary values for the clustering and in the next lesson you will see that they don't meet any of the criteria. We will then have to go through necessary pre-processing steps before anything else. By calling the "describe" function on a dataframe you will get a list of key statistics. Here we see that both the average values as well as standard deviations are different between the three variables.

7. Let's review the concepts

Now, let's review the main concepts to solidify our understanding.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.