Get startedGet started for free

Data distributions and transformations

1. Data distributions and transformations

Welcome back! In the previous exercises you practiced finding and handling missing values. In this video, we're going to discuss what it means to have different distributions between the data used to train a model and test data, or future data and how and when to transform your data.

2. Different distributions

There is a chance that the way in which we split a dataset before we send it into machine learning models can cause the training and test sets to have different distributions which introduces bias into the machine learning framework. It's obvious in the illustration shown that the distributions have both different means and different variances and this will likely contribute to poor model performance. You'll get to practice this concept in the exercises that follow and in chapter 2, we'll introduce the best technique to avoid this problem.

3. Train/test split

To prepare two distributions for visualization in this lesson, we'll use the 2-way multiple assignment seen here as train comma test since we aren't building models as of yet. Once we get into building machine learning models in the next chapter, we're going to use the 4-way split where we'll create a feature matrix capital X and target array y split into train and test named here as X_train, X_test, y_train and so on. Notice, however, that the keyword arguments are exactly the same with the dataset as the first argument. Test underscore size equal to 0.3 in this example means that the train test split of the data will be 70% training and 30% test. To create the plots you'll use the pairplot function from the seaborn module, which is always imported and aliased as sns, and returns a plot matrix of distributions and scatterplots. So that takes care of what we need to cover for distributions, let's move on to data transformations.

4. Data transformation

The left-hand side of this image depicts data that is right skewed. Outliers are one cause of non-normality or non-Gaussian behavior and transformation can reduce the influence of existing outliers. The right side shows the same data after it has been transformed with the log function. We'll talk about what to do if outliers still remain after any transformations in the next lesson. For now our goal is to transform data from a non-normal distribution to the most approximately normal one that we can as we need that for our models to perform better than they would otherwise.

5. Box-Cox transformations

In the exercises, you'll use the boxcox function from the scipy.stats module to perform power transformations. A power transformation raises each value in the dataset to a power p. The table you see here lists common values that can be passed to the transformation parameter lamdba. Note that it's spelled l m b d a. 0.0 is used for taking the log of x and 0.5 for the square root of x.

6. Let's practice!

Now, it's finally time for you to go check out some distributions and find some outliers!