Get startedGet started for free

vtreat on a small example

In this exercise, you will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

To design a treatment plan use the function designTreatmentsZ() (docs)

treatplan <- designTreatmentsZ(data, varlist)
  • data: the original training data frame
  • varlist: a vector of input variables to be treated (as strings).

designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)
  • varName: the name of the new treated variable
  • origName: the name of the original variable that the treated variable comes from
  • code: the type of the new variable.
    • "clean": a numerical variable with no NAs or NaNs
    • "lev": an indicator variable for a specific level of the original categorical variable.

(magrittr::use_series() (docs) is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either "clean" or "lev":

newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)

To transform the dataset into all numerical and one-hot-encoded variables, use prepare() (docs):

data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
  • treatplan: the treatment plan
  • data: the data frame to be treated
  • varRestrictions: the variables desired in the treated data

The dframe data frame and the magrittr package have been pre-loaded.

This exercise is part of the course

Supervised Learning in R: Regression

View Course

Exercise instructions

  • Print dframe. We will assume that color and size are input variables, and popularity is the outcome to be predicted.
  • Create a vector called vars with the names of the input variables (as strings).
  • Load the package vtreat.
  • Use designTreatmentsZ() to create a treatment plan for the variables in vars. Assign it to the variable treatplan.
  • Get and examine the scoreFrame from the treatment plan to see the mapping from old variables to new variables.
    • You only need the columns varName, origName and code.
    • What are the names of the new indicator variables? Of the continuous variable?
  • Create a vector newvars that contains the variable varName where code is either clean or lev. Print it.
  • Use prepare() to create a new data frame dframe.treat that is a one-hot-encoded version of dframe (without the outcome column).
    • Print it and compare to dframe.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Print dframe
dframe

# Create and print a vector of variable names
(vars <- ___)

# Load the package vtreat
___

# Create the treatment plan
treatplan <- ___(___, ___)

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(___, ___, ___))

# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% ___) %>%
    use_series(varName))

# Create the treated training data
(dframe.treat <- ___(___, ___, varRestriction = newvars))
Edit and Run Code