Preprocessing within a pipeline
Now that you've seen what steps need to be taken individually to properly process the Ames housing data, let's use the much cleaner and more succinct DictVectorizer
approach and put it alongside an XGBoostRegressor
inside of a scikit-learn pipeline.
This exercise is part of the course
Extreme Gradient Boosting with XGBoost
Exercise instructions
- Import
DictVectorizer
fromsklearn.feature_extraction
andPipeline
fromsklearn.pipeline
. - Fill in any missing values in the
LotFrontage
column ofX
with0
. - Complete the steps of the pipeline with
DictVectorizer(sparse=False)
for"ohe_onestep"
andxgb.XGBRegressor()
for"xgb_model"
. - Create the pipeline using
Pipeline()
andsteps
. - Fit the
Pipeline
. Don't forget to convertX
into a format thatDictVectorizer
understands by calling theto_dict("records")
method onX
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import necessary modules
____
____
# Fill LotFrontage missing values with 0
X.LotFrontage = ____
# Setup the pipeline steps: steps
steps = [("ohe_onestep", ____),
("xgb_model", ____)]
# Create the pipeline: xgb_pipeline
xgb_pipeline = ____
# Fit the pipeline
____