Get startedGet started for free

Develop machine learning pipeline

1. Develop machine learning pipeline

Let's now look at combining the steps of scaling and estimating the Model.

2. Pipeline

For this, we will use another concept from scikit-learn, called Pipelines. Pipelines are, in my opinion, the best feature scikit-learn has to offer, since they allow us to define a list of actions which need to be applied in sequence to the data, both during training, testing and utilization of the model. Pipelines apply a list of transformations and one final estimator to the data. Examples of transformations are the data conversions, or data scalers like the one we saw in the previous lesson. An estimator is a model predicting an output. We've seen an estimator before, the LogisticRegression. There are lots of others, but we'll not cover these in this course.

3. Create a Pipeline

We start by importing the modules required. We've seen StandardScaler and LogisticRegression before and we'll also need to import Pipeline from sklearn.pipeline. We then initialize both the StandardScaler and the LogisticRegression object, and store them in the variables sc and logreg respectively. We then create our pipeline. Pipeline takes a list of steps as tuples. The first element of each tuple is the name of the step. It's up to you to pick names for each step, but they must be unique per Pipeline and it should be clear what they do. We use scale and logreg in this case. The second elements of the tuples are the Standardscaler and LogisticRegression objects respectively.

4. Inspect Pipeline

If we print the pipeline object to screen, we can inspect the different steps we configured for the pipeline. As expected, we have the StandardScaler with the name "scale" - and logreg with the Logistic regression estimator. The steps are executed in the sequence they are passed in, so scaling will run before the logistic regression. The pipeline-object provides us with the same methods we used on the LogisticRegression object before, so we can now fit the Pipeline to the data X_train and y_train by calling .fit(). This step will perform a combined fit operation on both the StandardScaler and the logistic Regression. Similar to the Logistic regression, we can also predict values using the pipeline by calling .predict(). The data is running first through the scaler, before being classified by the Logistic regression.

5. Save model

Another important step in applying machine learning models to data streams is to save and reuse the trained model. Sklearn itself does not provide a way to save a trained model, however the documentation recommends using Python's pickle module for this task. We first import pickle. Then, we create a Path object with the filename, and open the file in binary write mode, or "bw", and pass this as second argument to pickle.dump(). The first argument is the Model or Pipeline to store. Pickle will store the trained model as a binary file to disk.

6. Load Model

Similar to writing to disk, we can also read a stored model from disk by using pickle.load(). As we stored a trained model, it's ready to predict values immediately after being loaded. A note of caution: You should not unpickle untrusted files. It can lead to malicious code being executed upon loading. It is however safe if you know and trust the file.

7. Let's practice!

And now it's your turn to build some Pipelines!