Get startedGet started for free

Median imputation

1. Median imputation

Real world data have missing values.

2. Dealing with missing values

This is a problem for most statistical or machine learning algorithms: they usually require numbers to work with, and don't know what to do with missing data. One common approach is to throw out rows with missing data, but this is generally not a good idea. It can lead to biases in your dataset and generate over-confident models. It can also, in extreme cases, lead to you throwing out all of your data. A much better strategy is to use the median to guess what a missing value would be, if it weren't missing. This is a very good idea if your data are "missing at random" and lets you model data that include rows with missing values.

3. Example: mtcars

Let's generate some data with missing values. We'll start with the mtcars dataset, which contains measurements of the physical characteristics of some cars. In this case, we want to predict the car's MPG, based on the other attributes of the car. Let's pretend some manufacturers don't report their car's horsepower, and randomly replace some points in this column with missing values. We can then split the dataset into a data frame of predictors (X) and the target we want to predict (Y). This demonstrates caret's non-formula interface for modeling. Unfortunately, due to the missing values in X, when we go to fit the model, it fails with a cryptic error. This is a point where many new users get stuck, and need to come looking for help.

4. A simple solution

The simple solution to this problem is to pass "medianImpute" to the preProcess argument for train, which tells caret to impute the missing values in X with their medians. caret actually does this imputation INSIDE each fold of the cross validation, so you get an honest assessment of the entire modeling process: the random forest model is fit after the imputation. This model now runs without error, and does not require you as a data scientist to do any additional work to clean your data.

5. Let’s practice!

Let's practice using median imputation.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.