Get startedGet started for free

Scaling data for feature comparison

1. Scaling data

Let's move on to talking about scaling our data.

2. What is feature scaling?

Scaling is a method of standardization that's most useful when we're working with a dataset that contains continuous features that are on different scales, and we're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors). Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. This will make it easier to linearly compare features, which is a requirement for many models in scikit-learn.

3. How to scale data

Let's take a look at another DataFrame. In each column, we have numbers that have consistent scales within columns, but not across columns. If we look at the variance, it's relatively low across columns. To better model this data, scaling would be a good choice here.

4. How to scale data

scikit-learn has a variety of scaling methods, but we'll focus on StandardScaler, which is imported from sklearn-dot-preprocessing. This method works by subtracting the mean and scaling each feature to have a variance of one. Once we instantiate a StandardScaler, we can apply the fit_transform method on the DataFrame. We can convert the output of fit_transform, which is a numpy array, to a DataFrame to look at it more easily. If we take a look at the newly scaled DataFrame, we can see that the values have been scaled down, and if we calculate the variance by column, it's not only close to 1, but it's now the same for all of our features.

5. Let's practice!

Now it's your turn to try scaling data using scikit-learn.