Get startedGet started for free

Scaling and transforming new data

1. Scaling and transforming new data

One of the most important aspects of machine learning is the actual application of any model you create on a new dataset. For example if you built a model based on historical data, ultimately you will want to apply this model on new data to make predictions.

2. Reuse training scalers

How you go about doing this depends on what transformations you make to the dataset before you fit the model. For example, if you applied the StandardScaler() to your data before fitting the model, you need to make sure you transform the test data using the same scalar before making predictions. Please do note that the scaler is fitted only on the training data. That is, you fit and transform the training data, but only transform the test data.

3. Training transformations for reuse

Similarly, if you intend to remove outliers from your test set, you should use the thresholds found on your train set to do so. If you were to use the mean and standard deviation of the test set, it could negatively impact your predictions. Note that it is only in very rare cases that you would want to remove outliers from your test set.

4. Why only use training data?

So why did we not refit the scaler on the test data or use thresholds from the test data? To avoid data leakage. In real life, you won't have access to the test data, that is, when you have deployed your model in production, you won't have access to future data, so you can't rely on it to make predictions and assess model performance.

5. Avoid data leakage!

Thus, you should always make sure you calibrate your preprocessing steps only on your training data or else you will overestimate the accuracy of your models.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.