1. One-Hot-Encoding Categorical Variables
To prepare for the next lesson on the xgboost implementation of gradient boosting, in this lesson you will learn how to safely convert categorical variables to indicator variables.
2. Why Convert Categoricals Manually?
As we mentioned in a previous lesson, most R modeling functions do a great job of managing categorical variables for you, so usually you don't have to worry about it. However, not all programming languages do this. In the next lesson, you will learn about xgboost, a package that originally comes from Python. xgboost does not directly accept categoricals; they must be converted to indicators or another numerical representation. In Python, this conversion is called one-hot encoding.
3. One-hot-encoding and data cleaning with `vtreat`
We will use the vtreat package to one-hot encode categorical variables. As a side effect, vtreat also cleans up missing values in both categorical and numeric data.
The basic idea is to design a treatment plan from the training data using the function designTreatmentsZ. This treatment plan records the steps needed to safely one-hot-encode not just the training data, but future data as well.
The function prepare converts the training and future data to a form that is compatible with xgboost: all numerical variables, with no missing values.
4. A Small vtreat Example
Let's work a small example. Here we have data where x is a categorical input variable with levels "one", "two" and "three", and u is a numeric input variable. y is the outcome.
5. Create the Treatment Plan
First create the treatment plan. designTreatmentsZ takes as input the training data and a list of the input variable names.
6. Get the New Variables
The treatment plan contains a member called the score frame, which holds a mapping from the original variable names to new variable names, and the type of the new variables. We are only interested in "lev" variables, which are the indicator variables, and “clean” variables, which are the numeric variables, cleaned to not hold bad values like NA or NaN. We can get a list of the new variable names from the scoreFrame.
7. Prepare the Training Data for Modeling
Once we have the variable names, we can use the prepare function and the treatment plan to create new treated training data suitable for training an xgboost model.
8. Before and After Data Treatment
The categorical variable is now three indicator variables, and the treated data is all numerical, with no missing values. Note that the outcome variable is not present in the treated data.
9. Prepare the Test Data Before Model Application
We must also use prepare on future data before model application.
10. vtreat Treatment is Robust
Sometimes, there are levels of a categorical variable that don’t appear in the training data. Usually, this will crash a model. vtreat encodings handle this situation gracefully.
vtreat can also do a meaningful encoding of categorical variables similar to the method ranger uses, but that is outside the scope of this lesson. For more information, see the vtreat documentation on github.
Now let’s practice using vtreat on data.
11. Let's practice!