Get Started

Splitting the data set

To make your training and test sets, you should first set a seed using set.seed(). Seeds allow you to create a starting point for randomly generated numbers, so that each time your code is run the same answer is generated. The advantage of doing this in your sampling is that you or anyone else can recreate the exact same training and test sets by using the same seed.

Using sample(), you can randomly assign observations to the training and test set.

For this exercise you will use the two first arguments in the sample() function:

  • The first argument is the vector from which we will sample values. We will randomly pick row numbers as indices; you can use 1:nrow(loan_data) to create the vector of row numbers.
  • The second argument is the number of items to choose. We will enter 2 / 3 * nrow(loan_data), as we construct the training set first.

This is a part of the course

“Credit Risk Modeling in R”

View Course

Exercise instructions

  • Set a seed of 567 using the set.seed() function.
  • Store the row indices of the training set in the object index_train. Use the sample() function with a first and a second argument as discussed above.
  • Create the training set by selecting the row numbers stored in index_train from the data set loan_data. Save the result to training_set.
  • The test set contains the rows that are not in index_train. Copy the code that you used to create the training set, but use the negative sign (-) right before index_train inside the square brackets. Save the result to test_set.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Set seed of 567


# Store row numbers for training set: index_train


# Create training set: training_set
training_set <- loan_data[___, ]

# Create test set: test_set

This exercise is part of the course

Credit Risk Modeling in R

IntermediateSkill Level
4.3+
3 reviews

Apply statistical modeling in a real-life setting using logistic regression and decision trees to model credit risk.

This chapter begins with a general introduction to credit risk models. We'll explore a real-life data set, then preprocess the data set such that it's in the appropriate format before applying the credit risk models.

Exercise 1: Introduction and data structureExercise 2: Exploring the credit dataExercise 3: Interpreting a CrossTable()Exercise 4: Histograms and outliersExercise 5: HistogramsExercise 6: OutliersExercise 7: Missing data and coarse classificationExercise 8: Deleting missing dataExercise 9: Replacing missing dataExercise 10: Keeping missing dataExercise 11: Data splitting and confusion matricesExercise 12: Splitting the data set
Exercise 13: Creating a confusion matrix

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free