Get startedGet started for free

Data preparation for segmentation

1. Data preparation for segmentation

Great job! Now the next step is to prepare the data for segmentation. Different models, especially k-means, have certain expectations for the input data.

2. Model assumptions

We will first start with building segmentation with K-means clustering. This method discovers segments well, only when the input variables are somewhat normally distributed, meaning there's no skewness which we observed in the pairwise plot. Also, K-means expects the data to be standardized with average value 0 and standard deviation of 1. Our second model of interest - non-negative matrix factorization - works well on raw sparse matrices, and does not hold such strong assumptions.

3. Unskewing data with log-transformation

Now, let's try to unskew our variables. The first step is to run a logarithmic transformation of the variables. We call the numpy's log function on the dataset and store it in a separate object. Let's call the pairwise plot function on the new dataset to explore if we managed the skewness.

4. Explore log-transformed data

We can see that the variables are less skewed and look more bell-shaped, although the Fresh variable is still slightly skewed to the left. Let's explore another transformation type.

5. Unskewing data with Box-Cox transformation

Box-Cox is a little bit more advanced transformation method. There is a function from `stats` module in the `scipy` package called `boxcox`. As it returns two objects, we will define a simple function to make our lives easier, so we can apply it directly to a dataframe. We define a boxcox_df function where the only parameter is x and we call the boxcox function on it, and store the result in two objects. The underscore means we don't use the second object, as we only care about the first one. Then the function returns it as an output. We can now apply this function on the wholesale dataset and store the transformed results as wholesale_boxcox. Let's see if we fixed the skewness in the Fresh variable.

6. Explore Box-Cox transformed data

We did it! The top-left variable is also bell-shaped. We can now move to the final data pre-processing step.

7. Scale the data

Scaled data is expected by K-means method. Otherwise it will not find well separated clusters. Scaling works by first subtracting the column average from each individual entry. This step ensures the column average is adjusted to zero. Then we divide the result by each column's standard deviation. This step makes the column's standard deviation equal to 1. Let's test this approach with StandardScaler() module from sklearn. We import the module from sklearn.preprocessing library, and initialize it as scaler. Then, we fit it on the dataset, transform it, and store as wholesale_scaled. The transformed dataset is stored as numpy array, so we have to create a pandas dataframe by passing the transformed data, then index and column names from the boxcox transformed pandas dataframe. Finally, we calculate the average and standard deviation to confirm the function did its job - all averages are close to zero, and standard deviations close to 1. The negative zeros are not a mistake - the actual value could be a some small negative or positive decimal, but when we round them, they become invisible, yet the sign remains.

8. Let's practice!

Great work! Let's test our knowledge in pre-processing data for segmentation.