Encoding categorical columns I: LabelEncoder
Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.
First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea. 
The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.
Cet exercice fait partie du cours
Extreme Gradient Boosting with XGBoost
Instructions
- Import 
LabelEncoderfromsklearn.preprocessing. - Fill in missing values in the 
LotFrontagecolumn with0using.fillna(). - Create a boolean mask for categorical columns. You can do this by checking for whether 
df.dtypesequalsobject. - Create a 
LabelEncoderobject. You can do this in the same way you instantiate any scikit-learn estimator. - Encode all of the categorical columns into integers using 
LabelEncoder(). To do this, use the.fit_transform()method oflein the provided lambda function. 
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Import LabelEncoder
____
# Fill missing values with 0
df.LotFrontage = ____
# Create a boolean mask for categorical columns
categorical_mask = (____ == ____)
# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()
# Print the head of the categorical columns
print(df[categorical_columns].head())
# Create LabelEncoder object: le
le = ____
# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: ____(x))
# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())