1. The problems with holdout sets
Hello again - let's continue our quest for validating machine learning models by discussing why traditional validation approaches still have pitfalls.
2. Transition validation
The typical modeling procedure looks something like this. We take a dataset, use, say, 80% for training, and the remaining 20% for testing.
We learned how to do this a couple lessons ago using scikit-learn. Using the train_test_split() function, we split our data and run a random forest classifier on this single split for our model. Here we have output the MAE, and the error was 10-point-24.
3. Traditional training splits
If we repeat this process with a different random seed though, we might get different results. Consider the following two samples from the ultimate candy-power-ranking dataset: s1 and s2. This dataset consists of 85 data points about candy characteristics, and we have randomly selected 60 candies for each sample.
Only 39 of the 60 candies overlap between the two datasets.
4. Traditional training splits
Furthermore, the first sample contains 34 chocolate candies, and the second sample only contains 30.
5. The split matters
Why is this important? Well, we have already seen that selecting 60 candies for a sample can be highly variable. If we split the candy dataset into 60 candies for training and 25 candies for testing, and build the exact same machine learning model, we'll probably get slightly varying results.
In this example alone, the second testing accuracy is over 12% worse. Using the first sample, you would report an error of 10-point-32. The second gives an error of 11-point-56. These results are way too different.
6. Train, validation, test
Even the train, validation, test procedure we discussed earlier is not safe from the problems we could have with holdout samples, especially when we have limited data.
Consider this example. We created a train, test, and validation split. We fit a random forest model, and maybe we even did some hyperparameter tuning or testing of various models. In the end, we decided on this random forest regressor model.
Look at how close the validation and testing accuracies are to each other - 9-point-18 and 8-point-98. This is awesome, right?
7. Round 2
Let's run the same model again, but this time we will run it with a different random seed. The errors were 8-point-73 and 10-point-91, which is a big problem. This can happen when using the traditional validation approach, especially with limited data. We think our model is validated, but if we just changed the sample we used - we get drastically different results.
This random forest model with only 25 trees and 4 features does not seem to generalize as well to new data as we would expect.
8. Holdout set exercises
To overcome this limitation of holdout sets, we use something called cross-validation, which is the gold-standard for model validation! Before we fully introduce cross-validation, let's discover why we need it with a couple of exercises.