Get startedGet started for free

Supervised learning pipelines

1. Supervised learning pipelines

Hi and welcome! This course is designed to give you four machine learning superpowers. To explain what these are, we need first to make sure you are up-to-speed with the basics. So, get ready for a quick review of machine learning pipelines!

2. Labeled data

We will mostly talk about classification in this course. Classifiers try to identify the classes of objects, often denoted by y, by looking at their features, often denoted by capital X. This relationship is extrapolated from a set of labeled examples. Take as an example the credit scoring dataset shown here. We try to predict whether a customer will default on their loan based on their loan application. The labeled examples form a matrix with one column per feature, and one row per example. The last column represents the class.

3. Feature engineering

Note that most classifiers expect numeric features. Hence, we need to convert string columns to numbers. This is an example of feature engineering, which we will meet again later in this course. We use the LabelEncoder class from the scikit-learn preprocessing module.

4. Model fitting

You can think of a model as an object with two methods. First, a .fit() method optimizes the model parameters using a labeled dataset. Then, the .predict() method of a fitted model predicts the labels of new examples. Take the example of the credit scoring dataset. First, split it into a matrix of features and a vector of labels. You can use the .drop() method to remove a column from a pandas DataFrame. A Gaussian naive Bayes model fitted to this data predicts 3 out of the first 5 examples correctly.

5. Model selection

The .fit() method does the best it can for the given class of models. But an altogether different kind of classifier might perform better still. Selecting among different classes of models is known as model selection. For example, we see here that an AdaBoost classifier has perfect accuracy on the first 5 examples.

6. Performance assessment

The proportion of correct labels in just five examples is not very strong evidence. Like with any other estimate, using more data increases our confidence. We can use the accuracy_score function from the scikit-learn module, metrics. AdaBoost still comes on top. But there is something wrong with this way of assessing performance.

7. Overfitting and data splitting

The problem is that assessing performance on the data used to train the classifier produces bias. The models are optimized to this particular data and will hence perform uncharacteristically well on that data. This phenomenon is known as overfitting. Instead, it is better to keep aside some data, as "test" data, for performance assessment and model selection. First use the train_test_split function from the model_selection module for this. Then fit the model on the training data, and use it to predict the labels of the test data.

8. Your first pipeline

So there you are: you now know how to fit a classifier to a dataset without overfitting. Load your data, engineer some features, split into training and test, and select the best-performing model!

9. So, what is this course about?

Unfortunately, real life is more complex, so we built this course to prepare you for it! We will focus on giving you four superpowers. First, how to tune your pipeline. Second, how to best incorporate domain expertise to ensure impact. Third, how to maintain good performance over time. And fourth, how to deal with few or low quality labels. These skills combined will make sure you stand out from the crowd!

10. Could you have prevented the mortgage crisis?

Given what you now know, could you have prevented the sub-prime mortgage crisis? Let's find out!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.