vtreat on a small example

In this exercise, you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

To design a treatment plan use the function designTreatmentsZ() (docs)

treatplan <- designTreatmentsZ(data, varlist)

data: the original training data frame
varlist: a vector of input variables to be treated (as strings).

designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)

varName: the name of the new treated variable
origName: the name of the original variable that the treated variable comes from
code: the type of the new variable.
- "clean": a numerical variable with no NAs or NaNs
- "lev": an indicator variable for a specific level of the original categorical variable.

(magrittr::use_series() (docs) is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either "clean" or "lev":

newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)

To transform the dataset into all numerical and one-hot-encoded variables, use prepare() (docs):

data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)

treatplan: the treatment plan
data: the data frame to be treated
varRestrictions: the variables desired in the treated data

The dframe data frame and the magrittr package have been pre-loaded.

Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted.
Create a vector called vars with the names of the input variables (as strings).
Load the package vtreat.
Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan.
Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables.
- You only need the columns varName, origName and code.
- What are the names of the new indicator variables? Of the continuous variable?
Create a vector newvars that contains the variable varName where code is either clean or lev. Print it.
Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column).
- Print it and compare to dframe.

What is Regression?

Training and Evaluating Regression Models

Issues to Consider

Dealing with Non-Linear Responses

Tree-Based Methods

Exercise

vtreat on a small example

Instructions