vtreat on a small example
In this exercise, you will use vtreat
to one-hot-encode a categorical variable on a small example.
vtreat
creates a treatment plan to transform categorical variables into indicator variables (coded "lev"
), and to clean bad values out of numerical variables (coded "clean"
).
To design a treatment plan use the function designTreatmentsZ()
(docs)
treatplan <- designTreatmentsZ(data, varlist)
data
: the original training data framevarlist
: a vector of input variables to be treated (as strings).
designTreatmentsZ()
returns a list with an element scoreFrame
: a data frame that includes the names and types of the new variables:
scoreFrame <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
select(varName, origName, code)
varName
: the name of the new treated variableorigName
: the name of the original variable that the treated variable comes fromcode
: the type of the new variable."clean"
: a numerical variable with no NAs or NaNs"lev"
: an indicator variable for a specific level of the original categorical variable.
(magrittr::use_series()
(docs) is an alias for $
that you can use in pipes.)
For these exercises, we want varName
where code
is either "clean"
or "lev"
:
newvarlist <- scoreFrame %>%
filter(code %in% c("clean", "lev") %>%
magrittr::use_series(varName)
To transform the dataset into all numerical and one-hot-encoded variables, use prepare()
(docs):
data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
treatplan
: the treatment plandata
: the data frame to be treatedvarRestrictions
: the variables desired in the treated data
The dframe
data frame and the magrittr
package have been pre-loaded.
This exercise is part of the course
Supervised Learning in R: Regression
Exercise instructions
- Print
dframe
. We will assume thatcolor
andsize
are input variables, andpopularity
is the outcome to be predicted. - Create a vector called
vars
with the names of the input variables (as strings). - Load the package
vtreat
. - Use
designTreatmentsZ()
to create a treatment plan for the variables invars
. Assign it to the variabletreatplan
. - Get and examine the
scoreFrame
from the treatment plan to see the mapping from old variables to new variables.- You only need the columns
varName
,origName
andcode
. - What are the names of the new indicator variables? Of the continuous variable?
- You only need the columns
- Create a vector
newvars
that contains the variablevarName
wherecode
is eitherclean
orlev
. Print it. - Use
prepare()
to create a new data framedframe.treat
that is a one-hot-encoded version ofdframe
(without the outcome column).- Print it and compare to
dframe
.
- Print it and compare to
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print dframe
dframe
# Create and print a vector of variable names
(vars <- ___)
# Load the package vtreat
___
# Create the treatment plan
treatplan <- ___(___, ___)
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(___, ___, ___))
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
filter(code %in% ___) %>%
use_series(varName))
# Create the treated training data
(dframe.treat <- ___(___, ___, varRestriction = newvars))