Scaling data for machine learning

1. Scaling data for machine learning

Now that you have split the data into train and test datasets, we should check it's performance.

2. Evaluate the model

For this, we will first fit our model to the train data. We use the model's .score() method, which gives us the percent of correct predictions, also called accuracy. The output is 0.78, so our Model scores 78%, so it could predict the correct weather condition 78% of the time. We should be able to improve this.

3. Scaling

Before applying machine learning algorithms, it is recommended to scale the data, so that all columns are in the same value range. Algorithms may behave badly or have one feature dominate all others if the data is not scaled properly. We will use the standard-scaler from scikit-learn for this task. Standard Scaler standardizes the columns by removing the mean, and scaling the data to their variance. With this, all series end up centered around 0, as we can see in the image.

4. Unscaled data

Let's remember how the data looks. We have 3 columns, also called features in a machine learning context: humidity, temperature and pressure. Each of the features has a different range.

5. Standardscaler

We import StandardScaler from sklearn.preprocessing. We then instantiate the Standardscaler() and assign it to the variable sc. Using .fit() on the scaler object, we fit the scaler to our data. This step let's the scaler find the parameters to properly scale the data. In the case of the Standard Scaler, these parameters are mean and variance. They are calculated per feature, and are available from the scaler object as sc.mean_ and sc.var_. We then use .transform() to scale the data. Note that the output will be a NumPy array, even though we passed in a DataFrame.

6. Standardscaler

We can convert this back to a DataFrame by using the original index, and original column headers to create a scaled version of the DataFrame. Remember that the first few lines of the original DataFrame had a pressure of around 1000, and a humidity around 70. Now all values are centered around 0, and have been scaled to their variance. We see the humidity at 0.47 - so it's at almost 75% of the range the humidity-column. Temperature is at -0.52 - so it's lower than the average temperature. The same happens with Pressure having a value of -0.65

7. Evaluate the model

Let's try to score the model again. Performance did improve to 88%, so it's now possible to correctly predict the weather condition 88% of the time.

8. Let's practice!

And now it's your turn to scale some data.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.