Supervised learning pipelines

1. Supervised learning pipelines

Hi and welcome! This course is designed to give you four machine learning superpowers. To explain what these are, we need first to make sure you are up-to-speed with the basics. So, get ready for a quick review of machine learning pipelines!

2. Labeled data

We will mostly talk about classification in this course. Classifiers try to identify the classes of objects, often denoted by y, by looking at their features, often denoted by capital X. This relationship is extrapolated from a set of labeled examples. Take as an example the credit scoring dataset shown here. We try to predict whether a customer will default on their loan based on their loan application. The labeled examples form a matrix with one column per feature, and one row per example. The last column represents the class.

3. Feature engineering

Note that most classifiers expect numeric features. Hence, we need to convert string columns to numbers. This is an example of feature engineering, which we will meet again later in this course. We use the LabelEncoder class from the scikit-learn preprocessing module.

4. Model fitting

You can think of a model as an object with two methods. First, a .fit() method optimizes the model parameters using a labeled dataset. Then, the .predict() method of a fitted model predicts the labels of new examples. Take the example of the credit scoring dataset. First, split it into a matrix of features and a vector of labels. You can use the .drop() method to remove a column from a pandas DataFrame. A Gaussian naive Bayes model fitted to this data predicts 3 out of the first 5 examples correctly.

5. Model selection

The .fit() method does the best it can for the given class of models. But an altogether different kind of classifier might perform better still. Selecting among different classes of models is known as model selection. For example, we see here that an AdaBoost classifier has perfect accuracy on the first 5 examples.

6. Performance assessment

The proportion of correct labels in just five examples is not very strong evidence. Like with any other estimate, using more data increases our confidence. We can use the accuracy_score function from the scikit-learn module, metrics. AdaBoost still comes on top. But there is something wrong with this way of assessing performance.

7. Overfitting and data splitting

The problem is that assessing performance on the data used to train the classifier produces bias. The models are optimized to this particular data and will hence perform uncharacteristically well on that data. This phenomenon is known as overfitting. Instead, it is better to keep aside some data, as "test" data, for performance assessment and model selection. First use the train_test_split function from the model_selection module for this. Then fit the model on the training data, and use it to predict the labels of the test data.

8. Your first pipeline

So there you are: you now know how to fit a classifier to a dataset without overfitting. Load your data, engineer some features, split into training and test, and select the best-performing model!

9. So, what is this course about?

Unfortunately, real life is more complex, so we built this course to prepare you for it! We will focus on giving you four superpowers. First, how to tune your pipeline. Second, how to best incorporate domain expertise to ensure impact. Third, how to maintain good performance over time. And fourth, how to deal with few or low quality labels. These skills combined will make sure you stand out from the crowd!

10. Could you have prevented the mortgage crisis?

Given what you now know, could you have prevented the sub-prime mortgage crisis? Let's find out!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks