Get startedGet started for free

Handling missing data

1. Handling missing data

Welcome to Preparing for machine learning interview questions! My name is Lisa Stuart and I am a Data Scientist. In this course I will cover most of the topics that will help you succeed in a Machine Learning interview using Python.

2. Prerequisites

Being an interview prep course, you will find the concepts and exercises more challenging than standard DataCamp courses, so please ensure that you're comfortable with the prerequisite courses Supervised Learning with scikit-learn and Unsupervised Learning in Python. We're going take what you learned so far and step it up so that by the end, you'll set yourself apart from other potential candidates in a ML interview.

3. Course outline

In this course, we'll start off chapter 1 by covering data pre-processing and visualization. The second and third chapters will be dedicated to supervised learning and unsupervised learning, and the fourth and final chapter will touch on Model selection and evaluation.

4. Machine learning (ML) pipeline

From a high level, the machine learning pipeline using scikit-learn looks something like this. You import the modules you need to use, instantiate an object which you then fit and predict.

5. Our ML pipeline

But, there is more to the story, so the pipeline we're going to use incorporates other important steps. Don't worry about the details, we'll start slow and build as we go, continually orienting ourselves to where we are in the process.

6. Missing data

In the remainder of this video lesson, we’re going to discuss how to find missing values as well as the impact of different techniques designed to fill missing data as a pre-processing step in the Machine Learning framework. This is an integral part of Exploratory Data Analysis you should always begin with.

7. Techniques

The 2 most commonly used strategies are omission, involving removal of rows and/or columns, and imputation which includes filling missing observations. You'll use two functions from scitkitlearn's impute module. SimpleImputer fills with zeros, the mean, median, or mode by supplying a value to the strategy keyword argument and IterativeImputer imputes by modeling each feature with missing values as a function of other features.

8. Why bother?

So why bother trying different imputation techniques when data is missing? Because how you handle missing values can introduce bias, so handling it appropriately will reduce that probability and, perhaps most importantly, most machine learning algorithms require complete data or an error is generated.

9. Effects of imputation

The exact effect an imputation technique will have on the distribution of a given feature depends on factors that include the missing values, the original variance, if there are any outliers, as well as the size and direction of skew. General guidelines are that removing rows and/or columns can result in removing too much data, filling with zeros tends to bias the results downward and median tends to be a better choice when there are outliers. Mode and iterative have varying degrees of helpfulness, so it's best to just give them a try.

10. More functions:

Here are a few more functions you'll use in the exercises. Combining isna with the sum function finds the number of missing values. Using bracket subsetting with an aggregate function such as mean returns the mean. Shape gives row and column dimensions while fillna fills missing values with the argument passed to it. By specifying include equals np.number, select_dtypes returns numeric columns or string columns for object. Finally, fit_transform fits and transforms the numeric columns passed to it.

11. Effects of missing values

Before we move onto the exercises, here is a multiple choice question for you. What are the effects of missing values in a machine learning setting? If the answer is not immediately apparent, pause this video to read through the possible answers and give yourself a moment to think about it. If you still aren't sure, consider re-watching the video lesson up to this point before revealing the answer in the next slide.

12. Effect of missing values: answer

The answer is that missing data introduces bias, so filling missing values by testing which impacts the variance of a given dataset the least is the best approach.

13. Effects of missing values: incorrect answers

These are the reasons why the other answers are incorrect, make sure you understand them.

14. Let's practice!

Time to handle missing data for yourself!