Apply median imputation
In this chapter, you'll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.
This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the NA
s.
Fortunately, the train()
function in caret
contains an argument called preProcess
, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the train()
function using formulas such as y ~ .
. An alternative way is to specify the x
and y
arguments to train()
, where x
is an object with samples in rows and features in columns and y
is a numeric or factor vector containing the outcomes. Said differently, x
is a matrix or data frame that contains the whole dataset you'd use for the data
argument to the lm()
call, for example, but excludes the response variable column; y
is a vector that contains just the response variable column.
For this exercise, the argument x
to train()
is loaded in your workspace as breast_cancer_x
and y
as breast_cancer_y
.
This exercise is part of the course
Machine Learning with caret in R
Exercise instructions
- Use the
train()
function to fit aglm
model calledmedian_model
to the breast cancer dataset. UsepreProcess = "medianImpute"
to handle the missing values. - Print
median_model
to the console.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Apply median imputation: median_model
median_model <- train(
x = ___,
y = ___,
method = ___,
trControl = myControl,
preProcess = ___
)
# Print median_model to console