Exercise

# Apply median imputation

In this chapter, you'll be using a version of the Wisconsin Breast Cancer dataset. This dataset presents a classic binary classification problem: 50% of the samples are benign, 50% are malignant, and the challenge is to identify which are which.

This dataset is interesting because many of the predictors contain missing values and most rows of the dataset have at least one missing value. This presents a modeling challenge, because most machine learning algorithms cannot handle missing values out of the box. For example, your first instinct might be to fit a logistic regression model to this data, but prior to doing this you need a strategy for handling the `NA`

s.

Fortunately, the `train()`

function in `caret`

contains an argument called `preProcess`

, which allows you to specify that median imputation should be used to fill in the missing values. In previous chapters, you created models with the `train()`

function using formulas such as `y ~ .`

. An alternative way is to specify the `x`

and `y`

arguments to `train()`

, where `x`

is an object with samples in rows and features in columns and `y`

is a numeric or factor vector containing the outcomes. Said differently, `x`

is a matrix or data frame that contains the whole dataset you'd use for the `data`

argument to the `lm()`

call, for example, but excludes the response variable column; `y`

is a vector that contains just the response variable column.

For this exercise, the argument `x`

to `train()`

is loaded in your workspace as `breast_cancer_x`

and `y`

as `breast_cancer_y`

.

Instructions

**100 XP**

- Use the
`train()`

function to fit a`glm`

model called`median_model`

to the breast cancer dataset. Use`preProcess = "medianImpute"`

to handle the missing values. - Print
`median_model`

to the console.