Scaling and transforming new data

1. Scaling and transforming new data

One of the most important aspects of machine learning is the actual application of any model you create on a new dataset. For example if you built a model based on historical data, ultimately you will want to apply this model on new data to make predictions.

2. Reuse training scalers

How you go about doing this depends on what transformations you make to the dataset before you fit the model. For example, if you applied the StandardScaler() to your data before fitting the model, you need to make sure you transform the test data using the same scalar before making predictions. Please do note that the scaler is fitted only on the training data. That is, you fit and transform the training data, but only transform the test data.

3. Training transformations for reuse

Similarly, if you intend to remove outliers from your test set, you should use the thresholds found on your train set to do so. If you were to use the mean and standard deviation of the test set, it could negatively impact your predictions. Note that it is only in very rare cases that you would want to remove outliers from your test set.

4. Why only use training data?

So why did we not refit the scaler on the test data or use thresholds from the test data? To avoid data leakage. In real life, you won't have access to the test data, that is, when you have deployed your model in production, you won't have access to future data, so you can't rely on it to make predictions and assess model performance.

5. Avoid data leakage!

Thus, you should always make sure you calibrate your preprocessing steps only on your training data or else you will overestimate the accuracy of your models.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

This exercise is part of the course

Feature Engineering for Machine Learning in Python

IntermediateSkill Level

4.8+

Start Course for Free

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Exercise 1: Why generate features?Exercise 2: Getting to know your data Exercise 3: Selecting specific data types Exercise 4: Dealing with categorical features Exercise 5: One-hot encoding and dummy variables Exercise 6: Dealing with uncommon categories Exercise 7: Numeric variables Exercise 8: Binarizing columns Exercise 9: Binning values

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Exercise 1: Why do missing values exist?Exercise 2: How sparse is my data?Exercise 3: Finding the missing values Exercise 4: Dealing with missing values (I)Exercise 5: Listwise deletion Exercise 6: Replacing missing values with constants Exercise 7: Dealing with missing values (II)Exercise 8: Filling continuous missing values Exercise 9: Imputing values in predictive models Exercise 10: Dealing with other data issues Exercise 11: Dealing with stray characters (I)Exercise 12: Dealing with stray characters (II)Exercise 13: Method chaining

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Exercise 1: Data distributions Exercise 2: What does your data look like? (I)Exercise 3: What does your data look like? (II)Exercise 4: When don't you have to transform your data?Exercise 5: Scaling and transformations Exercise 6: Normalization Exercise 7: Standardization Exercise 8: Log transformation Exercise 9: When can you use normalization?Exercise 10: Removing outliers Exercise 11: Percentage based outlier removal Exercise 12: Statistical outlier removal Exercise 13: Scaling and transforming new data

Current Exercise

Exercise 14: Train and testing transformations (I)Exercise 15: Train and testing transformations (II)

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Exercise 1: Encoding text Exercise 2: Cleaning up your text Exercise 3: High level text features Exercise 4: Word counts Exercise 5: Counting words (I)Exercise 6: Counting words (II)Exercise 7: Limiting your features Exercise 8: Text to DataFrame Exercise 9: Term frequency-inverse document frequency Exercise 10: Tf-idf Exercise 11: Inspecting Tf-idf values Exercise 12: Transforming unseen data Exercise 13: N-grams Exercise 14: Using longer n-grams Exercise 15: Finding the most common words Exercise 16: Wrap-up