LoslegenKostenlos loslegen

Encoding categorical columns I: LabelEncoder

Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.

First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea.

The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.

Diese Übung ist Teil des Kurses

Extreme Gradient Boosting with XGBoost

Kurs anzeigen

Anleitung zur Übung

  • Import LabelEncoder from sklearn.preprocessing.
  • Fill in missing values in the LotFrontage column with 0 using .fillna().
  • Create a boolean mask for categorical columns. You can do this by checking for whether df.dtypes equals object.
  • Create a LabelEncoder object. You can do this in the same way you instantiate any scikit-learn estimator.
  • Encode all of the categorical columns into integers using LabelEncoder(). To do this, use the .fit_transform() method of le in the provided lambda function.

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Import LabelEncoder
____

# Fill missing values with 0
df.LotFrontage = ____

# Create a boolean mask for categorical columns
categorical_mask = (____ == ____)

# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

# Print the head of the categorical columns
print(df[categorical_columns].head())

# Create LabelEncoder object: le
le = ____

# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: ____(x))

# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())
Code bearbeiten und ausführen