Encoding categorical columns II: OneHotEncoder
Okay - so you have your categorical columns encoded numerically. Can you now move onto using pipelines and XGBoost? Not yet! In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder
, the CollgCr
Neighborhood
was encoded as 5
, while the Veenker
Neighborhood
was encoded as 24
, and Crawfor
as 6
. Is Veenker
"greater" than Crawfor
and CollgCr
? No - and allowing the model to assume this natural ordering may result in poor performance.
As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or "dummy" variables. You can do this using scikit-learn's OneHotEncoder.
Diese Übung ist Teil des Kurses
Extreme Gradient Boosting with XGBoost
Anleitung zur Übung
- Import
OneHotEncoder
fromsklearn.preprocessing
. - Instantiate a
OneHotEncoder
object calledohe
. Specify the keyword argumentsparse=False
. - Using its
.fit_transform()
method, apply theOneHotEncoder
todf
and save the result asdf_encoded
. The output will be a NumPy array. - Print the first 5 rows of
df_encoded
, and then the shape ofdf
as well asdf_encoded
to compare the difference.
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
# Import OneHotEncoder
____
# Create OneHotEncoder: ohe
ohe = ____
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ____
# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])
# Print the shape of the original DataFrame
print(df.shape)
# Print the shape of the transformed array
print(df_encoded.shape)