Remove near zero variance predictors
As you saw in the video, for the next set of exercises, you'll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:
log((concentration of compound in brain) /
(concentration of compound in blood))
This gives a quantitative metric of the compound's ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.
One interesting aspect of this dataset is that it contains many variables and many of these variables have extremely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).
Fortunately, caret
contains a utility function called nearZeroVar()
for removing such variables to save time during modeling.
nearZeroVar()
takes in data x
, then looks at the ratio of the most common value to the second most common value, freqCut
, and the percentage of distinct values out of the number of total samples, uniqueCut
. By default, caret
uses freqCut = 19
and uniqueCut = 10
, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2
and uniqueCut = 20
when calling nearZeroVar()
.
This is a part of the course
“Machine Learning with caret in R”
Exercise instructions
bloodbrain_x
and bloodbrain_y
are loaded in your workspace.
- Identify the near zero variance predictors by running
nearZeroVar()
on the blood-brain dataset. Store the result as an object calledremove_cols
. UsefreqCut = 2
anduniqueCut = 20
in the call tonearZeroVar()
. - Use
names()
to create a vector containing all column names ofbloodbrain_x
. Call thisall_cols
. - Make a new data frame called
bloodbrain_x_small
with the near-zero variance variables removed. Usesetdiff()
to isolate the column names that you wish to keep (i.e. that you don't want to remove.)
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(___, names = TRUE,
freqCut = ___, uniqueCut = ___)
# Get all column names from bloodbrain_x: all_cols
# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(___, ___)]