1. Nominal predictors
In the previous lesson we focused on preprocessing numeric variables. We learned how to center and scale our data with step_normalize() and how to remove highly correlated variables with step_corr().
In this lesson we will learn how to train a recipe to process nominal predictor variables.
2. Nominal data
Nominal data values identify characteristics or groups.
Think of them as a set of categories with no meaningful order.
Some examples of nominal data include a department within a company, a person's native language, or the type of car you drive. In all of these examples, the values serve as labels for a particular category or group.
3. Transforming nominal predictors
Nominal data must be transformed to numeric data during feature engineering because many machine learning algorithms require numeric input.
One-Hot encoding is a transformation that maps the distinct values of a nominal variable to a sequence of 0/1 indicator variables.
Each unique value gets its own indicator variable.
Suppose we have a nominal variable that records the department in which employees work at a company. This variable has three unique values: Finance, Marketing and Technology. In the example, one-hot encoding will create a sequence of three indicator variables for this data. Notice how each indicator variable has a 1 in the row that matches the category in the original data. Since every row in the one-hot encoded results must sum to 1, it is redundant to have a column of every unique value in our data. For example, if department_marketing and department_technology are both equal to 0, then we know that the department value must be finance.
4. Transforming nominal predictors
Dummy variable encoding takes a different approach and removes that redundant information by excluding one value from the original set of data values.
If we have n distinct values in our categorical data, we will get n - 1 indicator variables.
This is the preferred method for tidymodels and is the default in the recipes package.
With this method our department variable is mapped to a sequence of two indicator variables.
5. Lead scoring data
In our lead scoring data, lead_source and us_location are nominal predictor variables.
6. Creating dummy variables
To transform these variables, we start by specifying a recipe with our model formula and leads_training data.
Then we pass this to the step_dummy function where we select the lead_source and us_location variables for processing.
We then pass the results to the prep function where the recipe is trained on the leads_training data.
Finally, this is passed into the bake function where we apply the transform on our leads_test data.
The results show that dummy variables have been created for both variables in the test data.
7. Selecting columns by type
A more robust way to perform feature engineering is to select columns by type. This can be done by passing a sequence of selector functions separated by commas into the step function of a recipe.
To select all factor or character columns in a data frame, we can use the all_nominal() selector function. To exclude the outcome variable, purchased, we use the all_outcomes() selector preceded by a minus sign.
With this code, we will get the same results. However, if our variable names change in the future, our code won't give us an error.
8. Preprocessing nominal predictor variables
Many modeling engines in R include automatic dummy variable creation, so it is possible to fit models without having to use step_dummy().
However, these methods are not consistent across engines in using one-hot versus dummy variables or naming conventions.
Using the recipes package standardizes this process and will make your code less susceptible to errors.
9. Let's practice!
Let's practice by applying step_dummy to our feature engineering pipeline on the telecommunications data!