Encoding categorical columns I: LabelEncoder

Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.

First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea.

The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.

Import LabelEncoder from sklearn.preprocessing.
Fill in missing values in the LotFrontage column with 0 using .fillna().
Create a boolean mask for categorical columns. You can do this by checking for whether df.dtypes equals object.
Create a LabelEncoder object. You can do this in the same way you instantiate any scikit-learn estimator.
Encode all of the categorical columns into integers using LabelEncoder(). To do this, use the .fit_transform() method of le in the provided lambda function.

script.py

IPython Shell

Classification with XGBoost

Regression with XGBoost

Fine-tuning your XGBoost model

Using XGBoost in pipelines

Exercise

Exercise

Encoding categorical columns I: LabelEncoder

Instructions