Get startedGet started for free

Cross-validation

1. Cross-validation

Great work on those regression challenges! Hopefully we are now feeling familiar with train test split and computing model performance metrics on our test set. But, there is a potential pitfall of this process.

2. Cross-validation motivation

If we're computing R-squared on our test set, the R-squared returned is dependent on the way that we split up the data! The data points in the test set may have some peculiarities that mean the R-squared computed on it is not representative of the model's ability to generalize to unseen data. To combat this dependence on what is essentially a random split, we use a technique called cross-validation.

3. Cross-validation basics

We begin by splitting the dataset into five groups or folds.

4. Cross-validation basics

Then we set aside the first fold as a test set,

5. Cross-validation basics

fit our model on the remaining four folds, predict on our test set,

6. Cross-validation basics

and compute the metric of interest, such as R-squared.

7. Cross-validation basics

Next, we set aside the second fold as our test set,

8. Cross-validation basics

fit on the remaining data, predict on the test set,

9. Cross-validation basics

and compute the metric of interest.

10. Cross-validation basics

Then similarly with the third fold,

11. Cross-validation basics

the fourth fold,

12. Cross-validation basics

and the fifth fold. As a result we get five values of R-squared from which we can compute statistics of interest, such as the mean, median, and 95% confidence intervals.

13. Cross-validation and model performance

As we split the dataset into five folds, we call this process 5-fold cross-validation. If we use 10 folds, it is called 10-fold cross-validation. More generally, if we use k folds, it is called k-fold cross-validation or k-fold CV. There is, however, a trade-off. Using more folds is more computationally expensive. This is because we are fitting and predicting more times.

14. Cross-validation in scikit-learn

To perform k-fold cross-validation in scikit-learn, we import cross_val_score from sklearn-dot-model_selection. We also import KFold, which allows us to set a seed and shuffle our data, making our results repeatable downstream. We first call KFold. The n_splits argument has a default of five, but in this case we assign six, allowing us to use six folds from our dataset for cross-validation. We also set shuffle to True, which shuffles our dataset before splitting into folds. We also assign a seed to the random_state keyword argument, ensuring our data would be split in the same way if we repeat the process making the results repeatable downstream. We save this as the variable kf. As usual, we instantiate our model, in this case, linear regression. We then call cross_val_score, passing the model, the feature data, and the target data as the first three positional arguments. We also specify the number of folds by setting the keyword argument cv equal to our kf variable. This returns an array of cross-validation scores, which we assign to cv_results. The length of the array is the number of folds utilized. Note that the score reported is R squared, as this is the default score for linear regression.

15. Evaluating cross-validation peformance

We can now print the scores. This returns six results ranging from zero-point-seven to approximately zero-point-seven-seven. We can calculate the mean score using np-dot-mean, and the standard deviation using np-dot-std. Additionally, we can calculate the 95% confidence interval using the np-dot-quantile function, passing our results followed by a list containing the upper and lower limits of our interval as decimals.

16. Let's practice!

Now let's apply k-fold cross-validation on our sales dataset!