Get startedGet started for free

Cross-validation

1. Cross-validation

Hello everyone - let's push validation a step further and discuss the gold-standard: cross-validation.

2. Cross-validation

Before, we talked about using 80% of our data for training and 20% for testing. We took this a step further by splitting the 80% of training data into training and validation splits. Previously, we learned that our accuracy metric on this validation set may be misleading, or if we split this data differently, we might get different results.

3. Cross-validation

For cross-validation we don't just need one of these training/validation splits— we need a bunch of them. This method makes us run our single model on various training/validation combinations and gives us a lot more confidence in our final metrics. For this example, we have a 5-fold cross-validation. Each time we run the model, a different 80% of the data will be used for training, and a different 20% will be used for validation. And we can do this in such a manner that all of the data will be used in only one of the validation sets. This ensures that every point is used for validation exactly one time. Although using each point in only one validation set is not required for cross-validation, it is often good practice to do so. And fortunately for us, this concept of what this should look like, how this could be done, and why it's even important is the hardest part. Actually implementing this is very straightforward.

4. KFold cross-validation with scikit-learn

scikit-learn's KFold() function gives us a few options for splitting data into several training and validation sets. We can specify the number of splits that we want; we can specify if the data needs to be shuffled and to replicate our results, we can specify a random state. Here I have generated two arrays to use as data. The X array consists of the numbers 0 through 39, and the y array consists of 20 zeros followed by 20 ones. Next, we create the generator kf, which will split our data. It uses the KFold() function with five splits and no shuffling. To actually split our data, we call kf-dot-split() on X. This only generates indices for us to use. So I don't want you to think that we have generated five training and validation datasets. All we have done is created a list of indices, that can be used for our splits.

5. Accessing indices

So what's actually in splits if it doesn't contain datasets? The splits variable contains the training and validation indices for the five different splits of X. If we print the length of the indices, we see train_index has 32 values, and test_index has eight values, and this is repeated five times. If we print out what these lists actually look like, we see train_index has the numbers 0 through 31, and test_index has the numbers 32 through 39. Calling these indices on X and y will give us training and validation data.

6. Example using splits

KFold is generally used when we want to fit the same model using KFold cross-validation. We would create the splits, using kf-dot-split(). We would then loop through the train and validation indices, and fit the same model using the new training data. Finally, we create the predictions and keep track of the errors. To see how well the model performed across the five splits that we created, we can look at the mean of the final error scores.

7. Practice time

Let's get started and fold some data!