1. Scaling and transformations
As mentioned in the last video, most machine learning algorithms require your data to be on the same scale for them to be effective,
2. Scaling data
For example it is difficult to compare salary values (often measured in thousands) with ages as shown here.
While this assumption of similar scales is necessary, it is rarely true in real world data. For this reason you need to rescale your data to ensure that it is on the same scale.
There are many different approaches to doing this but we will discuss the two most commonly used approaches here, Min-Max scaling (sometimes referred to as normalization), and standardization.
3. Min-Max scaling
Min-Max scaling is when your data is scaled linearly between a minimum and maximum value, often 0 and 1, with 0 corresponding with the lowest value in the column, and 1 with the largest. As it is a linear scaling while the values will change, the distribution will not. Take for example the Age column from the stackoverflow dataset, the raw values lie between 20 and 80, approximately.
4. Min-Max scaling
While here after min-max scaling, although the distribution is the same, the values sit fully between 0 and 1.
5. Min-Max scaling in Python
To implement this on your dataset, you first need to import MinMaxScaler from scikit learn's preprocessing module, scikit learn is the most commonly used machine learning library for python.
You then instantiate the MinMaxScaler() and fit it to your data. This tells the scaler how it should scale values when it performs the transformation. Finally, you need to actually transform the data with this new fitted scaler. Note that as this scaler assumes the max value it is created with is your upper bound, new data from outside this range may create unforeseen results.
6. Standardization
The other commonly used scaler is called standardization. As opposed to finding an outer boundary and squeezing everything within it, standardization instead finds the mean of your data and centers your distribution around it, calculating the number of standard deviations away from the mean each point is. These values (the number of standard deviations) are then used as your new values. This centers the data around 0 but technically has no limit to the maximum and minimum values as you can see here.
7. Standardization in Python
You can apply standardization in a similar fashion to how the min-max scaler was implemented. You first import StandardScaler from scikit-learn, instantiate and then fit it on your data. Once fitted you can apply it to your data.
8. Log Transformation
Both normalization and min-max scaling are types of scalers, in other words the data remained in the same shape but was squashed or scaled. A log transformation on the other hand can be used to make highly skewed distributions less skewed. Take for example one of the salary columns from the stack overflow dataset shown here where there is a very long right tail.
9. Log transformation in Python
Although it effects your data quite differently, a log transformation is implemented in Python the same way you have implemented scalers. To use a log transform you first import PowerTransformer from sklearn's preprocessing module, then you fit it to your dataset, and once fitted you can transform your data.
Log transformation is a type of power transformation, hence the name.
10. Final Slide
Now it is your turn to apply these three techniques to the data you are familiar with, and see what the transformed data looks like.