Iterating without overfitting
1. Iterating without overfitting
If you thought that after you deploy your model, your work is done, think again! There will always be opportunities to improve your model further.2. Why iterate?
Indeed, in what is known as "agile software development", you will be asked to deploy the simplest possible model as soon as you can. Then you can choose to tune it further while users start giving feedback.3. Why iterate?
You might also want to retrain your model on fresh data, coming in from production.4. Why iterate?
Or you might get other types of feedback, like ideas for new features, or loss functions.5. Champion-challenger
In all cases, you end up with one model in production, which is often referred to as the current "champion", and one in development, the "challenger", which might eventually replace the champion.6. Cross-validation results
However, recall that working on the same data for too long risks overfitting. You can dig deeper into cross-validation to detect this. Recall that cross-validation splits your data into training and test internally several times - in this example, three times as controlled by the parameter cv. Switch "return_train_score" to True when you fit the grid object. Then, cast the attribute cv-underscore-results-underscore of the output into a DataFrame. This yields a number of metrics, one per column, for each parameter combination, one per row. Focus on the mean training and test score, and their standard deviations, shown here.7. Cross-validation results
The first thing to note is whether the training score is a lot higher than the test score. Although we always expect this to be true, very large differences might indicate overfitting and a need for more training data. Another warning sign is when the standard deviation of the test score, especially that of the winning row, is very large. Excess variation of the test score across different ways of splitting the data indicates too much sensitivity to the training data, which is a sign of overfitting.8. Data splitting review
So let us step back and review our data splitting strategy. We start by splitting into training and test, and then via cross-validation we effectively split our training set further into a chunk used for model fitting, and one for model tuning and selection.9. Data splitting review
It is confusing to call two chunks of data test, so let us rename the initial test split as "validation". The validation set gives us an unbiased estimate of accuracy.10. Detecting overfitting
So there are a few possibilities to think about. If the CV train score is much larger than CV test, this is indication of overfitting in the model fit stage. So you could reduce the classifier complexity, or add more training data. Remember that 10-fold CV assigns nine tenths of the data for model fitting, whereas 3-fold only assigns two thirds. So increasing the number of CV folds can help. Analogously, if the CV Test score is much higher than the validation score, this suggests that you have overfitted during the tuning and selection step. Reducing the size of the parameter grid you are searching over.11. Fresh data
So far we have focused on further development using the same data.12. Fresh data
But what about incorporating fresh production data? Well, you could just add it to the mix, but you might find your performance is quite different on fresh data than on your training set. This is known as dataset shift, and it is the subject of the next video exercise!13. "Expert in CV" in your CV!
For now, it is time for you to dig deeper into cross-validation. Becoming an expert in cross-validation, also known as CV, is a great skill to have on your CV!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.