1. Training, test and validation splits
Two of the most important questions that a data scientist must answer when building machine learning models are:
How well would my model perform on new data?
and
Did I select the best performing model?
Throughout this chapter you will learn the techniques necessary to answer these questions.
2. Train-Test Split
To answer the first question,
"how well would my model perform on new data?"
Start with all of your data, this contains both the features and the outcome you want to predict
3. Train-Test Split
and split it into two portions.
4. Train-Test Split
The first portion is used to train a model and the second portion is used to test how well it performs on new data.
This is known as the train-test split.
In a disciplined machine learning workflow this is a critical first step.
So long as the test data is a fair representation of the data you can expect to see in the future you can use it to estimate the expected performance for future observations.
5. initial_split()
To make the train-test split you will use the initial_split() function from the rsample package.
The prop parameter is used to specify the proportion of data that will be selected for the train set, in this case it is 75%. This means that 25% of the data will be randomly withheld as the test set.
To prepare the training and the testing data frames you use the functions training() and testing(), respectively.
Of the 4004 observations in the gapminder dataset, 3003 or approximately 75% is partitioned into the training data and the remainder 25% is reserved as testing data.
6. Train-Validate Split
Because you are interested in keeping the test data independent you must not use it to make any decisions about your models.
So, to answer the second question:
"Did I select the best performing model?"
You must rely exclusively on the train data.
7. Train-Validate Split
The train data can be further split into two partitions of train and validate. Now you can use the new train data to build your models and use validate to calculate their performance.
8. Cross Validation
You can take this one step further by repeating this train-validate split several times. Each time reserving a different portion of the data for evaluation.
This is known as cross validation and it provides two key advantages:
First, by iteratively withholding different portions of the training data you can essentially use all of it to evaluate the overall performance of a model.
Second, you are able to calculate multiple measurements of performance. This helps account for the natural variability that would exist when measuring the performance of your models.
9. vfold_cv()
You can use the function vfold_cv() from the rsample package to build these cross validated pairs of train and validate data. The parameter v is used to indicate how many times the data should be split.
This new data frame now brings you back to the list column workflow. In order to build a model for each fold you will need to first extract the training and validation data frames into their own list columns.
10. Mapping train & validate
To do this you will use map() to apply the training() and testing() functions. This creates the desired train and validate data frames for each fold.
Notice that this is similar to what you've done with the initial split except now you're doing it for many splits.
11. Cross Validated Models
And you're back to building many models!
Just like in the last chapter you can use each of the 3 train data frames to build corresponding models.
12. Let's practice!
Now, let's progress to the exercises and apply what you've learned.