Undersampling the training set

In the video, you saw that to overcome the unbalanced data problem, you can use under- or oversampling. The training set has been undersampled for you, such that 1/3 of the training set consists of defaults, and 2/3 of non-defaults. The resulting data set is available in your workspace and named undersampled_training_set, and contains less observations (6570 instead of 19394). In this exercise, you will create a decision tree using the undersampled data set.

You will notice that the trees in this and the next exercises are very big, so big that you cannot really read them anymore. Don't worry about this for now, we will tell you how you can make them more manageable in the next video!

This exercise is part of the course

Credit Risk Modeling in R

View Course

Exercise instructions

  • The rpart package has been installed for you. Load the package in your workspace.
  • Change the code provided such that a decision tree is constructed using the undersampled training set instead of training_set. Additionally, add the argument control = rpart.control(cp = 0.001). cp, which is the complexity parameter, is the threshold value for a decrease in overall lack of fit for any split. If cp is not met, further splits will no longer be pursued. cp's default value is 0.01, but for complex problems, it is advised to relax cp.
  • Plot the decision tree using the function plot and the tree object name. Add a second argument uniform = TRUE to get equal-sized branches.
  • The previous command simply creates a tree with some nodes and edges, but without any text (or so-called "labels") on it. Use function text() with sole argument tree_undersample to add labels.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load package rpart in your workspace.


# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  training_set)

# Plot the decision tree


# Add labels to the decision tree