Splitting the data set

To make your training and test sets, you should first set a seed using set.seed(). Seeds allow you to create a starting point for randomly generated numbers, so that each time your code is run the same answer is generated. The advantage of doing this in your sampling is that you or anyone else can recreate the exact same training and test sets by using the same seed.

Using sample(), you can randomly assign observations to the training and test set.

For this exercise you will use the two first arguments in the sample() function:

The first argument is the vector from which we will sample values. We will randomly pick row numbers as indices; you can use 1:nrow(loan_data) to create the vector of row numbers.
The second argument is the number of items to choose. We will enter 2 / 3 * nrow(loan_data), as we construct the training set first.

This is a part of the course

“Credit Risk Modeling in R”

View Course

Exercise instructions

Set a seed of 567 using the set.seed() function.
Store the row indices of the training set in the object index_train. Use the sample() function with a first and a second argument as discussed above.
Create the training set by selecting the row numbers stored in index_train from the data set loan_data. Save the result to training_set.
The test set contains the rows that are not in index_train. Copy the code that you used to create the training set, but use the negative sign (-) right before index_train inside the square brackets. Save the result to test_set.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Set seed of 567


# Store row numbers for training set: index_train


# Create training set: training_set
training_set <- loan_data[___, ]

# Create test set: test_set

Edit and Run Code