Get Started

Creating train, test, and validation datasets

1. Creating train, test, and validation datasets

Hello everyone - Let's get started with creating training, testing, and validation datasets.

2. Traditional train/test split

In the first few lessons, we called data "seen" data if it was used for model fitting, while "unseen" data described the data we did not train our model on. In model validation, we use holdout samples to replicate this idea. We define a holdout dataset as any data that is not used for training and is only used to assess model performance. The available data is split into two datasets. One used for training, and one that is simply off limits while we are training our models, called a test (or holdout) dataset. This step is vital to model validation and is the number one step you can take to ensure your model's performance.

3. Dataset definitions and ratios

We use the holdout sample as a testing dataset so that we can have an unbiased estimate for our model's performance after we are completely done training. Generally, a good rule of thumb is using an 80:20 split. This equates to setting aside twenty percent of the data for the test set and using the rest for training. You might choose to use more training data when the overall data is limited, or less training data if the modeling method is computationally expensive.

4. The X and y datasets

Before we use scikit-learn's holdout creation function train_test_split(), we will use the tic_tac_toe dataset and create an X dataset with the predictive data, and a y dataset with just the responses. The first nine columns of tic_tac_toe can be used for training, while the 10th column contains the response values. As a quick aside, classification models for categorical values, such as those found in the tic_tac_toe dataset, require dummy variables. If you are unfamiliar with dummy variables, check out DataCamp's other courses that go into more detail here.

5. Creating holdout samples

The train_test_split() function is straightforward. We split both the X and the y datasets, into both a train and a test dataset. This function has a few parameters that we will use. test_size takes either a float or an integer and specifies how big the test set should be. If test_size is blank, you can instead use train_size to set the size of the training set. And finally, random_state allows for setting the model seed and helps maintain reproducibility.

6. Dataset for preliminary testing?

We know that the test set is off limits until we are completely done training, but what do we do when testing model parameters? For example, if we run a random forest model with 100 trees and one with 1000 trees, which dataset do we use to test these results?

7. Holdout samples for parameter tuning

When testing parameters, tuning hyper-parameters, or anytime we are frequently evaluating model performance we need to create a second holdout sample, called the validation dataset. For this dataset, the available data is the original training dataset, which is then split in the same manner used to split the original complete dataset. We use the validation sample to assess our model's performance when using different parameter values.

8. Train, validation, test continued

To create both holdout samples, the testing, and the validation datasets, we use scikit-learn's train_test_split() function twice. The first call will create training and testing datasets like normal. The second call we split this so-called temporary training dataset into the final training and validation datasets. In this example, we first used an 80/20 split to create the test set. With the 80% training dataset, we used a 75/25 split to create a validation dataset. Leaving us with 60% of the data for training, 20% for validation, and 20% for testing.

9. It's holdout time

Let's practice making holdout sets to use in our models.