Get startedGet started for free

Missing data and outliers

1. Missing data and outliers

In this lesson, we'll take a step back and explore a few scenarios prior to the model building stage. We'll look at how to handle missing data and then deal with outliers.

2. Handling missing data

How do we identify and correct for null values in our dataset? There are two common approaches. First is dropping the whole row when you detect a missing value, and second is imputing the missing values with some other value.

3. Drop the whole row

Dropping the whole row is likely the simplest approach of correcting for null values, as it can be done in one line of code. However, there are some trade offs to consider. By dropping any rows with a null value, you could potentially lose a significant portion of your dataset and exclude information that could strengthen your model or produce insights. In this example, the line of code would drop rows 4 and 6, since they contain at least one null value.

4. Impute missing values

Another option is to impute missing values for the nulls. This approach takes a little more thought, but allows you to preserve the information contained in the rows with some null values. There are a few popular ways to impute values. You can insert a constant value, like 0, insert a randomly selected record from another observation, use the mean, median, or mode, or use another model to predict the value and impute that as well.

5. A few useful functions

Let's discuss a few useful functions for these techniques. Isnull is a pandas function that identifies any rows in that have a null value. You can also take it a step further and specify which fields must be null. Similarly, you can use the pandas dropna function if you want to drop the rows outright; either all of them or a specified subset. Fillna is useful for imputation; you specify a technique and the DataFrame fills in the nulls.

6. Dealing with outliers

Moving on to outliers, there are a few different ways to use statistics to identify outliers, including standard deviation, or z-scores, as well as interquartile range, commonly referred to as IQR.

7. Standard deviations

Using standard deviations is a popular, straightforward method for identifying outliers and is the most likely to come up in your interview. Quite simply, any observation that falls outside of 3 standard deviations of the mean is deemed an outlier. On the normal curve shown, the tails make up around 0 point 1 percent of the population; anything past this threshold is considered an outlier.

8. Interquartile range (IQR)

Using IQR is another way to determine whether or not a value is an outlier. Recall box plots from earlier in the course. You can summarize your data pretty effectively in one plot using the median, quartiles, and range. The IQR is computed by subtracting the first quartile from the third quartile. Using this value, you can set outlier thresholds with the formula you see at the tail ends of the boxplot. You take 1 point 5 times the IQR and then add and subtract that from the first and third quartile. When you generate a box plot, you'll see these outliers represented as dots outside of the end points in the picture.

9. Summary

Let's summarize what we learned. We discussed ways to handle null values, including dropping a whole row or deciding on a value to impute in place of a null value. Then, we talked about identifying outliers using the three standard deviation threshold and interquartile range.

10. Let's prepare for the interview!

Let's get to work, and go try this out on some exercises!