vtreat on a small example
In this exercise, you will use vtreat to one-hot-encode a categorical variable on a small example.
vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").
To design a treatment plan use the function designTreatmentsZ() (docs)
treatplan <- designTreatmentsZ(data, varlist)
- data: the original training data frame
- varlist: a vector of input variables to be treated (as strings).
designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:
scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)
- varName: the name of the new treated variable
- origName: the name of the original variable that the treated variable comes from
- code: the type of the new variable.- "clean": a numerical variable with no NAs or NaNs
- "lev": an indicator variable for a specific level of the original categorical variable.
 
(magrittr::use_series() (docs) is an alias for $ that you can use in pipes.)
For these exercises, we want varName where code is either "clean" or "lev":
newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)
To transform the dataset into all numerical and one-hot-encoded variables, use prepare() (docs):
data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
- treatplan: the treatment plan
- data: the data frame to be treated
- varRestrictions: the variables desired in the treated data
The dframe data frame and the magrittr package have been pre-loaded.
Este exercício faz parte do curso
Supervised Learning in R: Regression
Instruções do exercício
- Print dframe. We will assume thatcolorandsizeare input variables, andpopularityis the outcome to be predicted.
- Create a vector called varswith the names of the input variables (as strings).
- Load the package vtreat.
- Use designTreatmentsZ()to create a treatment plan for the variables invars. Assign it to the variabletreatplan.
- Get and examine the scoreFramefrom the treatment plan to see the mapping from old variables to new variables.- You only need the columns varName,origNameandcode.
- What are the names of the new indicator variables? Of the continuous variable?
 
- You only need the columns 
- Create a vector newvarsthat contains the variablevarNamewherecodeis eithercleanorlev. Print it.
- Use prepare()to create a new data framedframe.treatthat is a one-hot-encoded version ofdframe(without the outcome column).- Print it and compare to dframe.
 
- Print it and compare to 
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# Print dframe
dframe
# Create and print a vector of variable names
(vars <- ___)
# Load the package vtreat
___
# Create the treatment plan
treatplan <- ___(___, ___)
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(___, ___, ___))
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% ___) %>%
    use_series(varName))
# Create the treated training data
(dframe.treat <- ___(___, ___, varRestriction = newvars))