Iterating without overfitting

1. Iterating without overfitting

If you thought that after you deploy your model, your work is done, think again! There will always be opportunities to improve your model further.

2. Why iterate?

Indeed, in what is known as "agile software development", you will be asked to deploy the simplest possible model as soon as you can. Then you can choose to tune it further while users start giving feedback.

3. Why iterate?

You might also want to retrain your model on fresh data, coming in from production.

4. Why iterate?

Or you might get other types of feedback, like ideas for new features, or loss functions.

5. Champion-challenger

In all cases, you end up with one model in production, which is often referred to as the current "champion", and one in development, the "challenger", which might eventually replace the champion.

6. Cross-validation results

However, recall that working on the same data for too long risks overfitting. You can dig deeper into cross-validation to detect this. Recall that cross-validation splits your data into training and test internally several times - in this example, three times as controlled by the parameter cv. Switch "return_train_score" to True when you fit the grid object. Then, cast the attribute cv-underscore-results-underscore of the output into a DataFrame. This yields a number of metrics, one per column, for each parameter combination, one per row. Focus on the mean training and test score, and their standard deviations, shown here.

7. Cross-validation results

The first thing to note is whether the training score is a lot higher than the test score. Although we always expect this to be true, very large differences might indicate overfitting and a need for more training data. Another warning sign is when the standard deviation of the test score, especially that of the winning row, is very large. Excess variation of the test score across different ways of splitting the data indicates too much sensitivity to the training data, which is a sign of overfitting.

8. Data splitting review

So let us step back and review our data splitting strategy. We start by splitting into training and test, and then via cross-validation we effectively split our training set further into a chunk used for model fitting, and one for model tuning and selection.

9. Data splitting review

It is confusing to call two chunks of data test, so let us rename the initial test split as "validation". The validation set gives us an unbiased estimate of accuracy.

10. Detecting overfitting

So there are a few possibilities to think about. If the CV train score is much larger than CV test, this is indication of overfitting in the model fit stage. So you could reduce the classifier complexity, or add more training data. Remember that 10-fold CV assigns nine tenths of the data for model fitting, whereas 3-fold only assigns two thirds. So increasing the number of CV folds can help. Analogously, if the CV Test score is much higher than the validation score, this suggests that you have overfitted during the tuning and selection step. Reducing the size of the parameter grid you are searching over.

11. Fresh data

So far we have focused on further development using the same data.

12. Fresh data

But what about incorporating fresh production data? Well, you could just add it to the mix, but you might find your performance is quite different on fresh data than on your training set. This is known as dataset shift, and it is the subject of the next video exercise!

13. "Expert in CV" in your CV!

For now, it is time for you to dig deeper into cross-validation. Becoming an expert in cross-validation, also known as CV, is a great skill to have on your CV!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Designing Machine Learning Workflows in Python

AdvancedSkill Level

4.8+

58 reviews

In the previous chapters you established a solid foundation in supervised learning, complete with knowledge of deploying models in production but always assumed you a labeled dataset would be available for your analysis. In this chapter, you take on the challenge of modeling data without any, or with very few, labels. This takes you into a journey into anomaly detection, a kind of unsupervised modeling, as well as distance-based learning, where beliefs about what constitutes similarity between two examples can be used in place of labels to help you achieve levels of accuracy comparable to a supervised workflow. Upon completing this chapter, you will clearly stand out from the crowd of data scientists in confidently knowing what tools to use to modify your workflow in order to overcome common real-world challenges.

Exercise 1: Anomaly detection Exercise 2: A simple outlier Exercise 3: LoF contamination Exercise 4: Novelty detection Exercise 5: A simple novelty Exercise 6: Three novelty detectors Exercise 7: Contamination revisited Exercise 8: Distance-based learning Exercise 9: Find the neighbor Exercise 10: Not all metrics agree Exercise 11: Unstructured data Exercise 12: Restricted Levenshtein Exercise 13: Bringing it all together Exercise 14: Concluding remarks