Estimating performance with cross validation

1. Estimating performance with cross validation

In this section, we will learn how to improve the model evaluation process with a method known as cross validation.

2. Training and test datasets

We have been creating training and test datasets in our modeling process where the training data is used for model fitting while the test data is reserved for model evaluation to guard against overfitting. One downside of this method is that we only get one estimate of model performance.

3. K-fold cross validation

K-fold cross validation is a technique that provides K estimates of model performance and is typically used to compare different model types, such as logistic regression and decision trees.

4. K-fold cross validation

The training data is randomly partitioned into K sets of roughly equal size, known as folds, which are used to perform K iterations of model fitting and evaluation. The test dataset is left out of this process so it can provide a final, independent estimate of model performance once a model type is chosen.

5. Machine learning with cross validation

If we have 5 folds, we will have five iterations of model training and evaluation.

6. Machine learning with cross validation

In the first iteration, fold 1 is reserved for model evaluation while the others are used for model training.

7. Machine learning with cross validation

In the second iteration, fold 2 is reserved for model evaluation while the others are used for model training.

8. Machine learning with cross validation

This process continues until the fifth iteration, where fold 5 is used for model evaluation. In total, this provides five estimates of model performance.

9. Creating cross validation folds

The vfold_cv() function creates cross validation folds and takes a tibble as the first argument, number of folds as the second, and a stratification variable as the third. To create 10 folds from our leads_training data, we set v equal to 10, and stratify by purchased to ensure each fold has similar proportions of the outcome values. Executing the set-dot-seed() function before the vfold_cv() function ensures reproducibility. This function takes any integer as an argument and sets the seed of R's random number generator. This results in a tibble with a list column named splits and an id column that identifies each fold. Each row in splits contains a data split object that has the instructions for splitting that row's fold into a training or evaluation set.

10. Model training with cross validation

The fit_resamples() function performs cross validation in tidymodels. To train our leads_workflow on each fold, we pass it to fit_resamples(), provide leads_folds to the resamples argument and our custom metric function to the optional metrics argument. By default, accuracy and ROC AUC are calculated. This returns a resamples object on which we can collect metrics. We see that each metric was estimated 10 times, one per each fold. The average of these estimates is provided in the mean column.

11. Detailed cross validation results

Passing summarize equals false into collect_metrics() will create a tibble with detailed results. For our leads_rs_fit, this gives us 30 total rows which represents our 3 metrics times our 10 folds. The dot_metric column identifies the metric while the dot-estimate column provides the estimated value for each fold.

12. Summarizing cross validation results

The results of collect_metrics() can be summarized with dplyr. Starting with rs_metrics, we group by the dot-metric column, then calculate summary statistics with the summarize() function for each metric in the dot-metric column. This provides a summary of the distribution of estimated metric values in our cross validation process.

13. Cross validation methodology

Resample model objects are not able to provide predictions on new data sources. Passing leads_rs_fit to predict() yields an error. The purpose of cross validation in tidymodels is not to fit a final model, but to compare the performance of different model types to discover which one works best for our data.

14. Let's cross validate!

Let's put our cross validation skills to use!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.