Remove near zero variance predictors
As you saw in the video, for the next set of exercises, you'll be using the blood-brain dataset. This is a biochemical dataset in which the task is to predict the following value for a set of biochemical compounds:
log((concentration of compound in brain) /
(concentration of compound in blood))
This gives a quantitative metric of the compound's ability to cross the blood-brain barrier, and is useful for understanding the biological properties of that barrier.
One interesting aspect of this dataset is that it contains many variables and many of these variables have extremely low variances. This means that there is very little information in these variables because they mostly consist of a single value (e.g. zero).
Fortunately, caret
contains a utility function called nearZeroVar()
for removing such variables to save time during modeling.
nearZeroVar()
takes in data x
, then looks at the ratio of the most common value to the second most common value, freqCut
, and the percentage of distinct values out of the number of total samples, uniqueCut
. By default, caret
uses freqCut = 19
and uniqueCut = 10
, which is fairly conservative. I like to be a little more aggressive and use freqCut = 2
and uniqueCut = 20
when calling nearZeroVar()
.
This is a part of the course
“Machine Learning with caret in R”
Exercise instructions
bloodbrain_x
and bloodbrain_y
are loaded in your workspace.
- Identify the near zero variance predictors by running
nearZeroVar()
on the blood-brain dataset. Store the result as an object calledremove_cols
. UsefreqCut = 2
anduniqueCut = 20
in the call tonearZeroVar()
. - Use
names()
to create a vector containing all column names ofbloodbrain_x
. Call thisall_cols
. - Make a new data frame called
bloodbrain_x_small
with the near-zero variance variables removed. Usesetdiff()
to isolate the column names that you wish to keep (i.e. that you don't want to remove.)
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Identify near zero variance predictors: remove_cols
remove_cols <- nearZeroVar(___, names = TRUE,
freqCut = ___, uniqueCut = ___)
# Get all column names from bloodbrain_x: all_cols
# Remove from data: bloodbrain_x_small
bloodbrain_x_small <- bloodbrain_x[ , setdiff(___, ___)]
This exercise is part of the course
Machine Learning with caret in R
This course teaches the big ideas in machine learning like how to build and evaluate predictive models.
In this chapter, you will practice using <code>train()</code> to preprocess data before fitting models, improving your ability to making accurate predictions.
Exercise 1: Median imputationExercise 2: Median imputation vs. omitting rowsExercise 3: Apply median imputationExercise 4: KNN imputationExercise 5: Comparing KNN imputation to median imputationExercise 6: Use KNN imputationExercise 7: Compare KNN and median imputationExercise 8: Multiple preprocessing methodsExercise 9: Order of operationsExercise 10: Combining preprocessing methodsExercise 11: Handling low-information predictorsExercise 12: Why remove near zero variance predictors?Exercise 13: Remove near zero variance predictorsExercise 14: preProcess() and nearZeroVar()Exercise 15: Fit model on reduced blood-brain dataExercise 16: Principle components analysis (PCA)Exercise 17: Using PCA as an alternative to nearZeroVar()What is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.