Get startedGet started for free

Apply median imputation

In this chapter, you'll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.

This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NAs.

Fortunately, the train() function in caret contains an argument called preProcess, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train() function using formulas such as y ~ .. An alternative way is to specify the x and y arguments to train(), where x is an object with samples in rows and features in columns and y is a numeric or factor vector containing the outcomes. Said differently, x is a matrix or data frame that contains the whole dataset you'd use for the data argument to the lm() call, for example, but excludes the response variable column; y is a vector that contains just the response variable column.

For this exercise, the argument x to train() is loaded in your workspace as breast_cancer_x and y as breast_cancer_y.

This exercise is part of the course

Machine Learning with caret in R

View Course

Exercise instructions

  • Use the train() function to fit a glm model called median_model to the breast cancer dataset. Use preProcess = "medianImpute" to handle the missing values.
  • Print median_model to the console.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Apply median imputation: median_model
median_model <- train(
  x = ___, 
  y = ___,
  method = ___,
  trControl = myControl,
  preProcess = ___
)

# Print median_model to console
Edit and Run Code