vtreat on a small example
In this exercise, you will use vtreat to one-hot-encode a categorical variable on a small example.
vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").
To design a treatment plan use the function designTreatmentsZ() (docs)
treatplan <- designTreatmentsZ(data, varlist)
data: the original training data framevarlist: a vector of input variables to be treated (as strings).
designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:
scoreFrame <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
select(varName, origName, code)
varName: the name of the new treated variableorigName: the name of the original variable that the treated variable comes fromcode: the type of the new variable."clean": a numerical variable with no NAs or NaNs"lev": an indicator variable for a specific level of the original categorical variable.
(magrittr::use_series() (docs) is an alias for $ that you can use in pipes.)
For these exercises, we want varName where code is either "clean" or "lev":
newvarlist <- scoreFrame %>%
filter(code %in% c("clean", "lev") %>%
magrittr::use_series(varName)
To transform the dataset into all numerical and one-hot-encoded variables, use prepare() (docs):
data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
treatplan: the treatment plandata: the data frame to be treatedvarRestrictions: the variables desired in the treated data
The dframe data frame and the magrittr package have been pre-loaded.
Este ejercicio forma parte del curso
Supervised Learning in R: Regression
Instrucciones del ejercicio
- Print
dframe. We will assume thatcolorandsizeare input variables, andpopularityis the outcome to be predicted. - Create a vector called
varswith the names of the input variables (as strings). - Load the package
vtreat. - Use
designTreatmentsZ()to create a treatment plan for the variables invars. Assign it to the variabletreatplan. - Get and examine the
scoreFramefrom the treatment plan to see the mapping from old variables to new variables.- You only need the columns
varName,origNameandcode. - What are the names of the new indicator variables? Of the continuous variable?
- You only need the columns
- Create a vector
newvarsthat contains the variablevarNamewherecodeis eithercleanorlev. Print it. - Use
prepare()to create a new data framedframe.treatthat is a one-hot-encoded version ofdframe(without the outcome column).- Print it and compare to
dframe.
- Print it and compare to
Ejercicio interactivo práctico
Prueba este ejercicio y completa el código de muestra.
# Print dframe
dframe
# Create and print a vector of variable names
(vars <- ___)
# Load the package vtreat
___
# Create the treatment plan
treatplan <- ___(___, ___)
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(___, ___, ___))
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
filter(code %in% ___) %>%
use_series(varName))
# Create the treated training data
(dframe.treat <- ___(___, ___, varRestriction = newvars))