Incorporating XGBoost into pipelines

1. Incorporating xgboost into pipelines

Now that you've had some practice using pipelines in scikit-learn, let's see what it takes to use xgboost within pipelines.

2. Scikit-learn pipeline example with XGBoost

This example is very similar to what was shown in the pipeline review that began this chapter. To get XGBoost to work within a pipeline, all that's really required is that you use XGBoost's scikit-learn API within a pipeline object. Let's see what that looks like in practice. As always, we first import everything we need for our purposes. We then proceed to load in the dataset and parse it into the matrix of features X and target vector y. At this point lies the only difference between using a scikit-learn native machine learning model and XGBoost. Specifically, we simply pass in an instance of the XGBoost XGBRegressor object into the pipeline where a normal scikit-learn estimator would be. The rest of the script is exactly what you've seen in the past. You compute the cross-validated negative mse using 10-fold cross-validation and then convert the 10-fold negative MSE into an average RMSE across all 10 folds. As you can see, without any hyperparameter tuning, the XGBoost model had a lower RMSE, of ~4-point-03 units, than the randomforest model we started the chapter with, which had an RMSE around 4-point-5.

3. Additional components introduced for pipelines

We wanted you to see how a simple case of pipelining with XGBoost works, however, in the final end-to-end example, we will take a dataset that involves significantly more wrangling before it can be used with XGBoost and put it through a pipeline as well. As a result, we will have to work with a library that is not part of the standard suite of scikit-learn tools, as well as work with parts of pipelines that you may not be familiar with. Sklearn_pandas is a separate library that attempts to bridge the gap between working with pandas and working with scikit-learn, as they don't always work seamlessly together. Specifically, sklearn_pandas has a generic class called DataFrameMapper, that allows for easy conversion between scikit-learn aware objects, or pure numpy arrays, and the DataFrames that are the bread and butter of the pandas library. We will also use some uncommon aspects of scikit-learn to accomplish our goals. Specifically, we will use the SimpleImputer class from scikit-learn's impute submodule, that allows us to fill in missing numerical and categorical values, and the FeatureUnion class found in scikit-learn's pipeline submodule. The FeatureUnion class allows us to combine separate pipeline outputs into a single pipeline output, as for example, we would need to do if we had one set of preprocessing steps we needed to perform on the categorical features of a dataset and a distinct set of preprocessing steps on the numeric features found in a dataset. The point is, we will introduce these topics at once, but don't want you to feel overwhelmed about what they are doing and how they can be used properly.

4. Let's practice!

In introducing these topics to you, I hope to give you a glimpse of what real world data preprocessing often involves but don't want you to feel overwhelmed. Hopefully, you just saw that its not particularly difficult to incorporate XGBoost into pipelines. Now, its your turn to practice what you just learned!