1. Categorical features
We started working with numerical features in the previous lesson. In this lesson, we will generate some new features from categorical variables.
2. Label encoding
Consider the example of a categorical feature on the slide. The majority of machine learning models does not handle string values and categorical features automatically. So, before passing the data to the model we need to pre-process the categorical features into some meaningful numbers. There are lots of ways to encode categorical features. We'll consider two of the most popular options.
The first one is label encoding. The idea is to map each category into the integer number. In this case, A is mapped into 0, B is mapped into 1 and so on.
3. Label encoding
To apply label encoding we will use LabelEncoder from sklearn.
Firstly, create the object of this class.
Then call the fit_transform() method on the column needed. df is an example DataFrame from the previous slide.
So, now we have label encoded categories!
The problem with Label encoding is that we implicitly assume that there is a ranking dependency between the categories. For example, category C has label 2 which is much higher than category A with label 0. Such an approach is harmful to linear models, although it still works for tree-based models.
4. One-Hot encoding
To overcome the problem of ranking dependency between the categories, we could use one-hot encoding. In this type of encoding, we create a separate column for each of the categories.
So, in this example, we created 4 columns instead of a single initial one. Then we set 1 for the corresponding category value and 0 for all other categories.
5. One-Hot encoding
There are multiple ways to implement one-hot encoding. We will consider pandas' get_dummies() method.
Let's call it on a column to be encoded specifying the prefix parameter that will assign column names.
Then we drop the initial categorical column, because it is not needed anymore.
Lastly, we concatenate the original features with the one-hot encoded feature into a single DataFrame.
The resulting DataFrame has the expected structure.
The drawback of such approach arises if the feature has a lot of different categories. For example, if we have a feature with 1,000 different categories, we'll have to create 1,000 new columns.
6. Binary Features
One special case of categorical features is binary features. It relates to categorical variables that have only two possible values.
For example, Yes-No answers or whether some property is On or Off.
For such features we always apply label encoding, substituting the first category with zero and the second category with one.
7. Other encoding approaches
There is a long list of other categorical features encoders.
8. Other encoding approaches
The most widely used at Kaggle is target encoder. We will learn more about it in the next lesson.
9. Let's practice!
But for now, let's get some practical experience with label and one-hot encoders!