Splitting the data set
To make your training and test sets, you should first set a seed using set.seed(). Seeds allow you to create a starting point for randomly generated numbers, so that each time your code is run the same answer is generated. The advantage of doing this in your sampling is that you or anyone else can recreate the exact same training and test sets by using the same seed.
Using sample(), you can randomly assign observations to the training and test set.
For this exercise you will use the two first arguments in the sample()
function:
- The first argument is the vector from which we will sample values. We will randomly pick row numbers as indices; you can use
1:nrow(loan_data)
to create the vector of row numbers. - The second argument is the number of items to choose. We will enter
2 / 3 * nrow(loan_data)
, as we construct the training set first.
This is a part of the course
“Credit Risk Modeling in R”
Exercise instructions
- Set a seed of 567 using the
set.seed()
function. - Store the row indices of the training set in the object
index_train
. Use thesample()
function with a first and a second argument as discussed above. - Create the training set by selecting the row numbers stored in
index_train
from the data setloan_data
. Save the result totraining_set
. - The test set contains the rows that are not in
index_train
. Copy the code that you used to create the training set, but use the negative sign (-
) right beforeindex_train
inside the square brackets. Save the result totest_set
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Set seed of 567
# Store row numbers for training set: index_train
# Create training set: training_set
training_set <- loan_data[___, ]
# Create test set: test_set
This exercise is part of the course
Credit Risk Modeling in R
Apply statistical modeling in a real-life setting using logistic regression and decision trees to model credit risk.
This chapter begins with a general introduction to credit risk models. We'll explore a real-life data set, then preprocess the data set such that it's in the appropriate format before applying the credit risk models.
Exercise 1: Introduction and data structureExercise 2: Exploring the credit dataExercise 3: Interpreting a CrossTable()Exercise 4: Histograms and outliersExercise 5: HistogramsExercise 6: OutliersExercise 7: Missing data and coarse classificationExercise 8: Deleting missing dataExercise 9: Replacing missing dataExercise 10: Keeping missing dataExercise 11: Data splitting and confusion matricesExercise 12: Splitting the data setExercise 13: Creating a confusion matrixWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.