Encoding categorical columns II: OneHotEncoder

Okay - so you have your categorical columns encoded numerically. Can you now move onto using pipelines and XGBoost? Not yet! In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or "dummy" variables. You can do this using scikit-learn's OneHotEncoder.

Este exercício faz parte do curso

Extreme Gradient Boosting with XGBoost

Ver curso

Instruções do exercício

Import OneHotEncoder from sklearn.preprocessing.
Instantiate a OneHotEncoder object called ohe. Specify the keyword argument sparse=False.
Using its .fit_transform() method, apply the OneHotEncoder to df and save the result as df_encoded. The output will be a NumPy array.
Print the first 5 rows of df_encoded, and then the shape of df as well as df_encoded to compare the difference.

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

# Import OneHotEncoder
____

# Create OneHotEncoder: ohe
ohe = ____

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ____

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])

# Print the shape of the original DataFrame
print(df.shape)

# Print the shape of the transformed array
print(df_encoded.shape)

Editar e executar o código