1. Centering and scaling
Data imputation is one of several important preprocessing steps for machine learning. Now let's cover another: centering and scaling our data.
2. Why scale our data?
Let's use df-dot-describe to check out the ranges of some of our feature variables in the music dataset.
We see that the ranges vary widely: duration_ms ranges from zero to one-point-six-two million, speechiness contains only decimal places, and loudness only has negative values!
3. Why scale our data?
Many machine learning models use some form of distance to inform them, so if we have features on far larger scales, they can disproportionately influence our model.
For example, KNN uses distance explicitly when making predictions.
For this reason, we actually want features to be on a similar scale.
To achieve this, we can normalize or standardize our data, often referred to as scaling and centering.
4. How to scale our data
There are several ways to scale our data: given any column, we can subtract the mean and divide by the variance so that all features are centered around zero and have a variance of one. This is called standardization.
We can also subtract the minimum and divide by the range of the data so the normalized dataset has minimum zero and maximum one. Or, we can center our data so that it ranges from -1 to 1 instead. In this video, we will perform standardization, but scikit-learn has functions available for other types of scaling.
5. Scaling in scikit-learn
To scale our features, we import StandardScaler from sklearn-dot-preprocessing.
We create our feature and target arrays.
Before scaling, we split our data to avoid data leakage.
We then instantiate a StandardScaler object, and call its fit_transform method, passing our training features.
Next, we use scaler-dot-transform on the test features.
Looking at the mean and standard deviation of the columns of both the original and scaled data verifies the change has taken place.
6. Scaling in a pipeline
We can also put a scaler in a pipeline! Here we build a pipeline object to scale our data and use a KNN model with six neighbors.
We then split our data, fit the pipeline to our training set, and predict on our test set.
Computing the accuracy yields a result of zero-point-eight-one. Let's compare this to using unscaled data.
7. Comparing performance using unscaled data
Here we fit a KNN model to our unscaled training data and print the accuracy.
It is only zero-point-five-three, so just by scaling our data we improved accuracy by over 50 percent!
8. CV and scaling in a pipeline
Let's also look at how we can use cross-validation with a pipeline. We first build our pipeline.
We then specify our hyperparameter space by creating a dictionary: the keys are the pipeline step name followed by a double underscore, followed by the hyperparameter name. The corresponding value is a list or an array of the values to try for that particular hyperparameter. In this case, we are tuning n_neighbors in the KNN model.
Next we split our data into training and test sets.
We then perform a grid search over our parameters by instantiating the GridSearchCV object, passing our pipeline and setting the param_grid argument equal to parameters.
We then fit it to our training data.
Lastly, we make predictions using our test set.
9. Checking model parameters
Printing GridSearchCV's best_score_ attribute, we see the score is very slightly better than our previous model's performance.
Printing the best parameters, the optimal model has 12 neighbors.
10. Let's practice!
Now let's incorporate scaling into our supervised learning workflow!