CommencerCommencer gratuitement

Log transformation

In the previous exercises you scaled the data linearly, which will not affect the data's shape. This works great if your data is normally distributed (or closely normally distributed), an assumption that a lot of machine learning models make. Sometimes you will work with data that closely conforms to normality, e.g the height or weight of a population. On the other hand, many variables in the real world do not follow this pattern e.g, wages or age of a population. In this exercise you will use a log transform on the ConvertedSalary column in the so_numeric_df DataFrame as it has a large amount of its data centered around the lower values, but contains very high values also. These distributions are said to have a long right tail.

Cet exercice fait partie du cours

Feature Engineering for Machine Learning in Python

Afficher le cours

Instructions

  • Import PowerTransformer from sklearn's preprocessing module.
  • Instantiate the PowerTransformer() as pow_trans.
  • Fit the PowerTransformer on the ConvertedSalary column of so_numeric_df.
  • Transform the same column with the scaler you just fit.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Import PowerTransformer
from sklearn.preprocessing import ____

# Instantiate PowerTransformer
pow_trans = ____

# Train the transform on the data
____

# Apply the power transform to the data
so_numeric_df['ConvertedSalary_LG'] = ____(so_numeric_df[['ConvertedSalary']])

# Plot the data before and after the transformation
so_numeric_df[['ConvertedSalary', 'ConvertedSalary_LG']].hist()
plt.show()
Modifier et exécuter le code