Log normalization

1. Log normalization

The first method we'll cover for standardization is log normalization.

2. What is log normalization?

Log normalization is a method for standardizing data that can be useful when we have features with high variance. Log normalization applies a logarithmic transformation to our values, which transforms them onto a scale that approximates normality - an assumption that many models make. The method of log normalization we're going to work with takes the natural log of each number; this is the exponent you would raise above the mathematical constant e (approximately equal to 2-point-718) to get that number.

3. What is log normalization?

Looking at the following table, the log of 30 is 3-point-4, because e to the power of 3-point-4 equals 30. Log normalization is a good strategy when you care about relative changes in a linear model, but still want to capture the magnitude of change, and when we want to keep everything in the positive space. It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.

4. Log normalization in Python

Applying log normalization to data in Python is fairly straightforward. We can use the log function from NumPy to do the transformation. Here we have a DataFrame of some values. If we check the variance of the columns, we can see that column 2 has a significantly higher variance than column 1, which makes it a clear candidate for log normalization. To apply log normalization to column 2, we need the log function from numpy. We can pass the column we want to log normalize directly into the function. If we take a look at both column 2 and the log-normalized column-2, we can see that the transformation has scaled down the values. If we check the variance of both column 1 and the log-normalized column 2, we can see that the variances are now much closer together.

5. Let's practice!

Now it's your turn! Let's take another look at the wine dataset and do some normalization.