Get Started

Standardization

1. Standardization

It's possible that you'll come across datasets with lots of numerical noise, perhaps due to feature variance or differently-scaled data. The preprocessing solution for that is standardization.

2. What is standardization?

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn, this is often a necessary step, because many models make underlying assumptions that the training data is normally distributed, and if it isn't, we could risk risk biasing your model. Data can be standardized in many different ways, but in this course, we're going to talk about two methods: log normalization and scaling. It's also important to note that standardization is a preprocessing method applied to continuous, numerical data. We'll cover methods for dealing with categorical data later in the course.

3. When to standardize: linear distances

There are a few different scenarios in which we'd want to standardize your data. First, if we're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features we're giving it are related in a linear fashion, or can be measured with a linear distance metric, which may not always be the case.

4. When to standardize: high variance

Standardization should also be used when dataset features have a high variance, which is also related to distance metrics. This could bias a model that assumes the data is normally distributed. If a feature in our dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.

5. When to standardize: different scales

Modeling a dataset that contains continuous features that are on different scales is another standardization scenario. For example, consider predicting house prices using two features: the number of bedrooms and the last sale price. These two features are on vastly different scales, which will confuse most models. To compare these features, we must standardize them to put them in the same linear space. All of these scenarios assume we're working with a model that makes some kind of linearity assumptions; however, there are a number of models that are perfectly fine operating in a nonlinear space, or do a certain amount of standardization upon input, but they're outside the scope of this course.

6. Let's practice!

Now that you've learned when to standardize your data, let's test your knowledge.