1. Encoding categorical data using supervised learning
We have discussed unsupervised encoding, where we assign a numeric value to a factor based on its values alone.
2. Introducing supervised encoding
Supervised encoding uses the outcome values to derive numeric features from nominal predictors.
3. Introducing supervised encoding
There are many supervised encoding methods available. The embed package provides quite a few, and are compatible with the tidymodels framework. We will explore three common linear encoding methods.
step_lencode_glm() uses likelihood encodings to convert a nominal predictor into a single set of scores derived from a generalized linear model.
step_lencode_bayes() applies Bayesian likelihood encodings to convert a nominal predictor into a single set of scores derived from a generalized linear model estimated using Bayesian analysis.
step_lencode_mixed() Converts nominal predictors into a single set of scores derived from a generalized linear mixed model.
Let's compare the three in a prediction setting.
4. Predicting grant application success
The grants dataset contains information on successful and unsuccessful grant requests and several predictors. However, we are interested in predicting success based on the sponsor_code alone, which is a factor.
We build our usual workflow, declaring logistic regression as our model and adding step_lencode_glm to the recipe. Notice that since it is a supervised method, we need to indicate the outcome variable enclosed by the function vars().
We can print the complete workflow to see a summary.
The Bayes and mixed steps syntax is identical, so we won't show the code for each.
5. Fitting, augmenting, and assessing
We fit the model with the test split and evaluate it with test data, using accuracy and roc_auc as metrics from our user-defined class_evaluate function.
Our results are alright, considering we relied on only one nominal predictor.
6. Binding models together
We can compare the performance induced by the glm, Bayes, and mixed steps in the recipe by summarizing our results in a tibble.
We first define a vector with model names for identification purposes. The terms are duplicated as we have two metrics per model.
We then bind the three models by rows, add our model names vector as a column and discard the .estimator column we don't need.
Finally, we spread the tibble to show the metrics as rows and the models as columns.
While quite similar, the glm step resulted in higher accuracy at the cost of a slightly lower roc_auc.
7. Visualizing our results
We can visualize our results in a parallel coordinates chart from the Gally package.
The graph shows the metrics for each model joined by straight lines, emphasizing change. It is clear that the glm encoding yields the highest accuracy, while the roc_auc trade-off is less pronounced. It is important to notice the units, as the differences are small and might not be significant.
8. Let's practice!
Let's take these ideas for a ride.