Creating random test datasets
Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.
As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.
The sample()
function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.
Use the resulting vector of row IDs to subset the loans into training and testing datasets. The dataset loans
is available for you to use.
This is a part of the course
“Supervised Learning in R: Classification”
Exercise instructions
- Apply the
nrow()
function to determine how many observations are in theloans
dataset, and the number needed for a 75% sample. - Use the
sample()
function to create an integer vector of row IDs for the 75% sample. The first argument ofsample()
should be the number of rows in the data set, and the second is the number of rows you need in your training set. - Subset the
loans
data using the row IDs to create the training dataset. Save this asloans_train
. - Subset
loans
again, but this time select all the rows that are not insample_rows
. Save this asloans_test
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Determine the number of rows for training
# Create a random sample of row IDs
sample_rows <- sample(___, ___)
# Create the training dataset
loans_train <- loans[___]
# Create the test dataset
loans_test <- loans[___]
This exercise is part of the course
Supervised Learning in R: Classification
In this course you will learn the basics of machine learning for classification.
Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We'll use the Lending Club dataset to simulate this scenario.
Exercise 1: Making decisions with treesExercise 2: Building a simple decision treeExercise 3: Visualizing classification treesExercise 4: Understanding the tree's decisionsExercise 5: Growing larger classification treesExercise 6: Why do some branches split?Exercise 7: Creating random test datasetsExercise 8: Building and evaluating a larger treeExercise 9: Conducting a fair performance evaluationExercise 10: Tending to classification treesExercise 11: Preventing overgrown treesExercise 12: Creating a nicely pruned treeExercise 13: Why do trees benefit from pruning?Exercise 14: Seeing the forest from the treesExercise 15: Understanding random forestsExercise 16: Building a random forest modelWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.