Local validation

1. Local validation

After some preliminary steps, we come to one of the crucial parts of the solution process: local validation.

2. Motivation

Before we start, let's discuss the motivation for local validation. Recall the plot with possible overfitting to Public test data. The problem we observe here is that we can't detect the moment when our model starts overfitting by looking only at the Public Leaderboard. That's where local validation comes into play. Using only train data, we want to build some kind of an internal, or local, approximation of the model's performance on a Private test data.

3. Holdout set

The question is: how do we build such an approximation of the model's performance? The simplest way is to use a holdout set. We split all train data (in other words, all the observations we know the target variable for) into train and holdout sets.

4. Holdout set

We then build a model using only the train set and make predictions on the holdout set. So, the holdout is similar to the usual test data, but the target variable is known.

5. Holdout set

It allows to compare predictions with the actual values and gives us a fair estimate of the model's performance. However, such an approach is similar to just looking at the results on the Public Leaderboard. We always use the same data for model evaluation and could potentially overfit to it. A better idea is to use cross-validation.

6. K-fold cross-validation

The process of K-fold cross-validation is presented on the slide. We split the train data into K non-overlapping parts called 'folds' (in this case K is equal to 4).

7. K-fold cross-validation

Then train a model K times on all the data except for a single fold. Each time, we also measure the quality on this single fold the model has never seen before. K-fold cross-validation gives our model the opportunity to train on multiple train-test splits instead of using a single holdout set. This gives us a better indication of how well our model will perform on unseen data.

8. K-fold cross-validation

To apply K-fold cross-validation with scikit-learn, import it from the model_selection module. Create a KFold object with the following parameters: n_splits is the number of folds, shuffle is whether the data is sorted before splitting. Generally, it's better to always set this parameter to True. And random_state sets a seed to reproduce the same folds in any future run. Now, we need to train K models for each cross-validation split. To obtain all the splits we call the split() method of the KFold object with a train data as an argument. It returns a list of training and testing observations for each split. The observations are given as numeric indices in the train data. These indices could be used inside the loop to select training and testing folds for the corresponding cross-validation split. For pandas DataFrame it could be done using the iloc operator, for example.

9. Stratified K-fold

Another approach for cross-validation is stratified K-fold. It is the same as usual K-fold, but creates stratified folds by a target variable. These folds are made by preserving the percentage of samples for each class of this variable. As we see on the image, each fold has the same classes distribution as in the initial data. It is useful when we have a classification problem with high class imbalance in the target variable or our data size is very small.

10. Stratified K-fold

Stratified K-fold is also located in sklearn's model_selection module. It has the same parameters as the usual KFold: n_splits, shuffle and random_state. The only difference is that on top of the train data, we should also pass the target variable into the split() call in order to make a stratification.

11. Let's practice!

As you can see, there are various validation strategies available. Let's try them out!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.