Data distributions

1. Data distributions

An important consideration before building a machine learning model is to understand what the distribution of your underlying data looks like. A lot of algorithms make assumptions about how your data is distributed or how different features interact with each other. For example almost all models besides tree based models require your features to be on the same scale. Feature engineering can be used to manipulate your data so that it can fit the assumptions of the distribution, or at least fit it as closely as possible.

2. Distribution assumptions

Almost every model besides tree based models assume that your data is normally distributed. Normal distributions follow a bell shape like shown here, the main characteristics of a normal distribution is that 68 percent of the data lies within 1 standard deviation of the mean,95% percent lies within 2 standard deviations from the mean and 99.7% fall within 3 standard deviations from the mean.

3. Observing your data

To understand the shape of your own data you can create histograms of each of the continuous features. To do so, once you have the matplotlib library loaded, along with your DataFrame, run hist() on your data frame followed by calling plt dot show to observe the graph. Here we see the first column has a fairly normal looking distribution, but the second looks quite different, with the majority of the data skewed to the lower values. This is also referred to having a long right tail.

4. Delving deeper with box plots

While histograms can be useful to show the high level distribution of the data, it does not show details such as where the middle chunk of your data sits in an easily readable fashion. For this you can use the box plot. The box plot shows the distribution of the data by calculating where the middle 50% of the data sits, this is also known as the Inter quartile range or IQR (it sits between the 1st and 3rd quartile) and marking it with the box. The whiskers extend to the minimum of 1.5 times the IQR from the edge of the box or the maximum range of the data. Any points outside this are marked as outliers. This can be useful to also see if there are points in your dataset that may be unwanted outliers.

5. Box plots in pandas

To create a box plot in pandas, you can call the boxplot() method on a list of columns you wish to plot.

6. Paring distributions

One final approach to looking at the distribution of data is to examine how different features in your DataFrame interact with each other. This type of chart is called a pairplot and can be useful to see if multiple columns are correlated with each other or whether they have any association at all. To generate a pairplot, first you need to import the seaborn package and then call the pairplot() function on your DataFrame. In this example we can see that the first and last columns are somewhat related.

7. Further details on your distributions

While all these plots are very useful to get an understanding of your data's shape, you will at times want to quickly get summary statistics of your data's distribution. This can be found using the describe() method as seen here on the same dummy dataset we have been using to demonstrate the plots.

8. Let's practice!

Why is this important? Now that you are capable of seeing the underlying structure of the data, in later lessons, you will remove outliers and ensure all features are on comparable scales.

This exercise is part of the course

Feature Engineering for Machine Learning in Python

IntermediateSkill Level

4.8+

Start Course for Free

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Exercise 1: Why generate features?Exercise 2: Getting to know your data Exercise 3: Selecting specific data types Exercise 4: Dealing with categorical features Exercise 5: One-hot encoding and dummy variables Exercise 6: Dealing with uncommon categories Exercise 7: Numeric variables Exercise 8: Binarizing columns Exercise 9: Binning values

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Exercise 1: Why do missing values exist?Exercise 2: How sparse is my data?Exercise 3: Finding the missing values Exercise 4: Dealing with missing values (I)Exercise 5: Listwise deletion Exercise 6: Replacing missing values with constants Exercise 7: Dealing with missing values (II)Exercise 8: Filling continuous missing values Exercise 9: Imputing values in predictive models Exercise 10: Dealing with other data issues Exercise 11: Dealing with stray characters (I)Exercise 12: Dealing with stray characters (II)Exercise 13: Method chaining

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Exercise 1: Data distributions

Current Exercise

Exercise 2: What does your data look like? (I)Exercise 3: What does your data look like? (II)Exercise 4: When don't you have to transform your data?Exercise 5: Scaling and transformations Exercise 6: Normalization Exercise 7: Standardization Exercise 8: Log transformation Exercise 9: When can you use normalization?Exercise 10: Removing outliers Exercise 11: Percentage based outlier removal Exercise 12: Statistical outlier removal Exercise 13: Scaling and transforming new data Exercise 14: Train and testing transformations (I)Exercise 15: Train and testing transformations (II)

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Exercise 1: Encoding text Exercise 2: Cleaning up your text Exercise 3: High level text features Exercise 4: Word counts Exercise 5: Counting words (I)Exercise 6: Counting words (II)Exercise 7: Limiting your features Exercise 8: Text to DataFrame Exercise 9: Term frequency-inverse document frequency Exercise 10: Tf-idf Exercise 11: Inspecting Tf-idf values Exercise 12: Transforming unseen data Exercise 13: N-grams Exercise 14: Using longer n-grams Exercise 15: Finding the most common words Exercise 16: Wrap-up