Exercise

Encoding categorical columns I: LabelEncoder

Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.

First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea.

The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.

Instructions

100 XP
  • Import LabelEncoder from sklearn.preprocessing.
  • Fill in missing values in the LotFrontage column with 0 using .fillna().
  • Create a boolean mask for categorical columns. You can do this by checking for whether df.dtypes equals object.
  • Create a LabelEncoder object. You can do this in the same way you instantiate any scikit-learn estimator.
  • Encode all of the categorical columns into integers using LabelEncoder(). To do this, use the .fit_transform() method of le in the provided lambda function.