1. Target encoding
We eventually come to one of the secret sauces of Kaggle competitions. It's called target encoding.
2. High cardinality categorical features
To begin with, let's discuss high cardinality categorical features. These are categorical features that have a large number of category values (at least over 10 different category values).
A label encoder would encode each category with a separate number. In case of high-cardinality, it means that we'll have a feature with lots of unordered integer numbers.
Another option is a one-hot encoder. In this case, we have to create a large number of new features for each category value.
So, the best alternative to the two methods above is target encoding. As a label encoder, it creates only a single column, but it also introduces the correlation between the categories and the target variable.
3. Mean target encoding
There are various options for the encoding function. We will consider the most frequently used on Kaggle: the mean target encoding.
Say we have a binary classification problem with a single categorical feature. On the left is our train data with known labels.
On the right is our test data on which we'd like to make predictions.
4. Mean target encoding
To apply mean target encoding to a particular feature we need to perform the following steps:
First, we calculate the mean target value for each category on the whole train data. Then we apply these statistics to the corresponding category in the test data.
Next, we divide the train data into folds. For each fold, we calculate the target mean on all the folds except for this particular one. It's called 'out-of-fold' data. Further, out-of-fold statistics are applied to this particular fold. This prevents overfitting to the train set.
Now, both train and test data have this new feature. So, we can add this mean target encoded feature to our model.
5. Calculate mean on the train
To encode categories in the test data, we simply take the whole train data and calculate mean target values for each category.
6. Calculate mean on the train
In this case, for category A it equals 0.66 (2 positive values out of 3 observations).
7. Calculate mean on the train
And for category B it equals 0.25 (1 positive value out of 4 observations).
8. Test encoding
These statistics are applied to the corresponding category in the test data. As a result, we've obtained a new feature.
9. Train encoding using out-of-fold
Now, we need to calculate this mean target encoded feature for the train data. As we said, we'll be using out-of-fold statistics.
Let's split the train data into 2 folds: one and two.
10. Train encoding using out-of-fold
Take fold number 1. We calculate the target mean out of this fold, so using only fold number 2 observations.
11. Train encoding using out-of-fold
That's why category A obtains 0 and category B obtains 0.5.
12. Train encoding using out-of-fold
Now we calculate out-of-fold target means for the second fold using only the first fold observations.
13. Train encoding using out-of-fold
Thus, category A obtains 1 and category B obtains 0.
We now have this mean encoded category in both the train and test data. So, we can use it as a new feature and pass to our model.
14. Practical guides
Before moving to practice, let's discuss some practical tips that are always applied together with mean target encoding.
15. Practical guides
The first one is smoothing.
Initially, for a specific category, we took a simple mean. However, if we had some rare categories with only one or two values, they would get a strict 0 or 1 mean encoding. It could lead to overfitting.
That's why we introduce regularization. We first calculate the global mean. It is the target mean value for the whole train data. Then assume that we add alpha new observations with this global mean to each category. Now, if the category is large, we will trust the mean encoding, otherwise, we will stick to the global mean.
Alpha is a hyperparameter we have to specify manually. Usually, values from 5 to 10 work pretty well by default.
16. Practical guides
Another practical advice is about new categories in the test data. In such case, we do not know what is the target mean value for this category. That's why new category values are simply filled in with a target global mean.
17. Practical guides
Take a look at the example. In the initial setting category A would get 0.5 and category B -- one third.
However, with alpha equals 5, category A gets about 0.43 and category B -- about 0.38. While the new category C in the test data gets the global mean, that equals 0.4.
18. Let's practice!
All right, let's transform these theoretical considerations into the Python code!