Encoding categorical columns I: LabelEncoder
Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.
First, you will need to fill in missing values - as you saw previously, the column LotFrontage
has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea.
The data has five categorical columns: MSZoning
, PavedDrive
, Neighborhood
, BldgType
, and HouseStyle
. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.
Cet exercice fait partie du cours
Extreme Gradient Boosting with XGBoost
Instructions
- Import
LabelEncoder
fromsklearn.preprocessing
. - Fill in missing values in the
LotFrontage
column with0
using.fillna()
. - Create a boolean mask for categorical columns. You can do this by checking for whether
df.dtypes
equalsobject
. - Create a
LabelEncoder
object. You can do this in the same way you instantiate any scikit-learn estimator. - Encode all of the categorical columns into integers using
LabelEncoder()
. To do this, use the.fit_transform()
method ofle
in the provided lambda function.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Import LabelEncoder
____
# Fill missing values with 0
df.LotFrontage = ____
# Create a boolean mask for categorical columns
categorical_mask = (____ == ____)
# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()
# Print the head of the categorical columns
print(df[categorical_columns].head())
# Create LabelEncoder object: le
le = ____
# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: ____(x))
# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())