Get startedGet started for free

Centering and scaling data

1. Centering and scaling variables

Welcome to the next piece of data pre-processing for k-means. In this lesson we will learn the importance of centering and scaling the variables.

2. Identifying an issue

The first thing to do is diagnostics, to identify if there is actually an issue. We will analyze key statistics of the dataset and compare the mean and standard deviation of each variable. We will also use the describe() function on our RFM datamart. There we go - we can immediately see that each of our three variables has both a different average value and a different standard deviation. In the next step, we will deal with centering the data to make the average values match.

3. Centering variables with different means

Given that k-means works well on variables with the same mean values, we will have to center them. Centering variables is a simple procedure which is done by calculating an average for each variable and then subtracting it from each observation. In python this is relatively straightforward and we can do it uniformly for all variables at once. We will store our data to datamart_centered, and then review key statistics of it. On a side note - we have rounded the numbers since after subtraction the output of the describe() function will have a scientific expression given that the values have many decimals. As you will see, the average values are now very close to zero for all three variables.

4. Scaling variables with different variance

Now, we will do the same with the different variance values for each variable as we've seen before. K-means works better on variables with not only the same average values, but with the same variance. The mechanism to scale the variables to the same variance is as simple as centering - we just have to divide each observation by the standard deviation of each variable. We use the standard deviation function std() on the datamart and then divide our original dataset by this value. As previously, we can do it all at once for all variables. We store the result as the datamart_scaled dataframe. Now we can review if this did the job. The results are clear: the standard deviation is comparable across each of the transformed variables.

5. Combining centering and scaling

Now, the last step is to combine the two. There are actually several options to normalize variables. One is the options we reviewed previously - manually subtract the mean and divide by standard deviation. Another alternative is to use a built-in function from the scikit-learn library called StandardScaler(). It does return a numpy.ndarray instead of pandas dataframe which is - in our case - an advantage as k-means will run faster on this type of output. It works pretty simply. You just import StandardScaler() from sklearn.preprocessing, and initialize it as scaler. Then run a fit() function on the data, and create a NumPy array datamart_normalized by applying a transform function on the original data. Let's calculate the mean and standard deviation for the transformed data. Great! We have reached the same results with this library as with the manual transformations. Although results are the same, using the StandardScaler() method from scikit-learn is beneficial when building machine learning pipelines.

6. Test different approaches by yourself!

Now, you practice how to center and scale different kinds of variables and dataframes!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.