1. Review of pipelines using sklearn
Let's begin the final chapter in this course by reviewing how pipelines are used in scikit-learn. Refreshing our memory about how pipelines work will allow us to use XGBoost effectively in pipelines going forward. Before working through an example script using pipelines, lets briefly go over how they work.
2. Pipeline review
Pipelines in sklearn are objects that take a list of named tuples as input. The named tuples must always contain a string name as the first element in each tuple and any scikit-learn compatible transformer or estimator object as the second element. Each named tuple in the pipeline is called a step, and the list of transformations that are contained in the list are executed in order once some data is passed through the pipeline.
This is done using the standard fit/predict paradigm that is standard in scikit-learn. Finally, where pipelines are really useful is that they can be used as input estimator objects into other scikit-learn objects themselves, the most useful of which are the cross_val_score method, which allows for efficient cross-validation and out of sample metric calculation, and the grid search and random search approaches for tuning hyperparameters.
3. Scikit-learn pipeline example
Now that we've talked about how pipelines work, lets seem them in action. In this example, we will use the Boston Housing dataset.
As you've seen many times before, we first import all of the functionality we will need for the example. We will use a randomforestregressor model to predict housing prices, and will import pipeline from sklearn's pipeline submodule.
In lines 2-4, we load in our data and create our X feature matrix and y target vector.
Lines 5-6 are the ones that do the real work here. In line 5, we create our pipeline, which contains a standardscaler transformer followed by our RandomForestRegressor estimator. Line 6 takes the just created pipeline estimator as an input along with our X matrix and y vector and performs 10-fold cross-validation using the pipeline and the data and outputs the neg_mean_squared_error as an evaluation metric once per fold.
As a brief aside, neg_mean_squared_error is scikit-learn's API-specific way of calculating the mean squared error in an API-compatible way. Negative mean squared errors don't actualy exist as all squares must be positive when working with real numbers.
4. Scikit-learn pipeline example
Thus, in lines 7 and 8 we simply take the absolute value of the scores, take each of their square roots, and compute their mean to get a root mean squared error across all 10 cross-validation folds.
We can see that on average our prediction was off by about 4-point-5 units. In the following exercises, because we will be working with the Ames housing dataset, which is more complex than the Boston housing dataset,
5. Preprocessing I: LabelEncoder and OneHotEncoder
some additional preprocessing steps will be required.
Specifically, we will do the same preprocessing steps in two different ways, only one of which can be done within a pipeline.
The first approach involves using the LabelEncoder and OneHotEncoder classes of scikit-learn’s preprocessing submodule one after the other.
LabelEncoder simply converts a categorical column of strings into integers that map onto those strings.
OneHotEncoder takes a column of integers that are treated as categorical values, and encodes them as dummy variables, which you may already be familiar with.
The problem with this 2-step method, however, is that it cannot currently be done within a pipeline. However, not all hope is lost. The second approach,
6. Preprocessing II: DictVectorizer
which involves using a dict-vectorizer, can accomplish both steps in one line of code.The DictVectorizer is a class found in scikit-learn’s feature extraction submodule, and is traditionally used in text processing pipelines by converting lists of feature mappings into vectors. Using pandas DataFrames, we don’t initially have such a list, however, if we explicitly convert a DataFrame into a list of dictionary entries, then we have exactly what we need. For more details on these classes, I encourage you to explore the scikit-learn documentation.
7. Let's build pipelines!
You will use both approaches in the next few exercises. I hope you have fun building pipelines!