Novel levels
When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix
or caret::dummyVars
to one-hot-encode will not work correctly.
vtreat
is a "safer" alternative to model.matrix
for one-hot-encoding, because it can manage novel levels safely. vtreat
also manages missing values in the data (both categorical and continuous).
In this exercise, you will see how vtreat
handles categorical values that did not appear in the training set.
The treatment plan treatplan
and the set of variables newvars
from the previous exercise are still available.
dframe
and a new data frame testframe
have been pre-loaded.
This exercise is part of the course
Supervised Learning in R: Regression
Exercise instructions
- Print
dframe
andtestframe
.- Are there colors in
testframe
that didn't appear indframe
?
- Are there colors in
- Call
prepare()
to create a one-hot-encoded version oftestframe
(without the outcome). Call ittestframe.treat
and print it.- Use the
varRestriction
argument to restrict to only the variables innewvars
. - How are the yellow rows encoded?
- Use the
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# treatplan is available
summary(treatplan)
# newvars is available
newvars
# Print dframe and testframe
___
___
# Use prepare() to one-hot-encode testframe
(testframe.treat <- ___(___, ___, varRestriction = ___))