Creating random test datasets
Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.
As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.
The sample()
function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.
Use the resulting vector of row IDs to subset the loans into training and testing datasets. The dataset loans
is available for you to use.
This is a part of the course
“Supervised Learning in R: Classification”
Exercise instructions
- Apply the
nrow()
function to determine how many observations are in theloans
dataset, and the number needed for a 75% sample. - Use the
sample()
function to create an integer vector of row IDs for the 75% sample. The first argument ofsample()
should be the number of rows in the data set, and the second is the number of rows you need in your training set. - Subset the
loans
data using the row IDs to create the training dataset. Save this asloans_train
. - Subset
loans
again, but this time select all the rows that are not insample_rows
. Save this asloans_test
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Determine the number of rows for training
# Create a random sample of row IDs
sample_rows <- sample(___, ___)
# Create the training dataset
loans_train <- loans[___]
# Create the test dataset
loans_test <- loans[___]