1. Managing skewed variables
Good job! Now we will dive deep into techniques to manage skewed variables.
2. Identifying skewness
In the previous lesson, we discussed that the easiest way to identify skewness is by visually analyzing the distribution of each variable.
And if we see something like this - it means the data is skewed.
3. Exploring distribution of recency
Let's explore the distribution of our recency variable.
We will first import the seaborn library and the pyplot module from matplotlib.
Then we will use distplot() on the Recency variable to plot its distribution.
As you can see, the Recency metric has a tail on the right, so we can conclude that it is skewed.
Let's try the same thing on the Frequency.
4. Exploring distribution of frequency
We already have the libraries imported so we'll just plot the distribution for frequency directly.
Frequency has an even worse skewness - the tail is again on the right side, but the majority of observations are between zero and roughly one hundred, while there are values spreading up to fourteen hundred.
Let's see if we can deal with the skewness and make it more symmetrical.
5. Data transformations to manage skewness
The easiest way to unskew the data is applying a logarithmic transformation, but this only works on positive values. There are other approaches like a Box-Cox transformation, but we will use the simpler version for this example.
We first import the NumPy library as np and then apply the log() function on the frequency variable. We store it as frequency_log.
Finally, we plot it like in the previous slides.
There we go! Although it's not perfectly symmetrical, it has very little skewness compared to the original distribution.
6. Dealing with negative values
As pointed out earlier, one important thing to remember about logarithmic transformation is that it only works on positive values. Although customer behavioral or purchase data is almost always positive, there are some techniques to manage negative values.
The simplest way to manage negative values is to add a constant value to each variable. The choice of value is arbitrary but the best practice is to add the absolute value of the lowest negative value to each observation, and then a small constant like 1, to force the variables to be strictly positive.
Another option is to choose a different transformation method. Calculating a cube root works quite well in some cases.
While these are useful tips, the fortunate thing with dealing with customer behavior data is that it almost always positive, so we don't have to worry about this.
7. Let's practice how to identify and manage skewed variables!
Great job! Now it's your turn to identify skewed variables and apply transformations to un-skew them!