1. Sequence of structuring pre-processing steps
In this lesson, I want to focus on the importance of sequence when structuring the pre-processing steps.
2. Why the sequence matters?
First, we have to discuss why the sequence of the transformations matters at all.
Since the log transformation only works with strictly positive data, we have to run it first as our customer behavior or purchasing values will almost always be positive.
Additionally, the centering and scaling process forces each variable to have a mean of 0 and a standard deviation of 1. This almost always introduces negative values unless all observations are equal to the same value.
3. Sequence
Therefore, the sequence of the pre-processing pipelines should be structured in a specific order.
First, we have to unskew the data with a log transformation.
Only then we standardize the variable to the same average values,
and scale them to the same standard deviation.
Finally, we store the results as a separate object than the original dataset. This is a critical moment as after we're done with clustering, we will come back to the original values to calculate statistics for each of the clusters based on raw values.
4. Coding the sequence
First we run the log transformation by applying the log() function from the NumPy library. We store it as datamart_log to separate it from the raw values.
Then we run the centering and scaling by using the StandardScaler() from the scikit-learn library.
Finally, we store the results of the normalized data into a different object called datamart_normalized.
This is it! We are now ready to run k-means clustering and identify valuable customer segments.
5. Practice on RFM data!
Congratulations! You will now revisit the key concepts and apply the pre-processing pipeline to the RFM data.