Undersampling the training set
In the video, you saw that to overcome the unbalanced data problem, you can use under- or oversampling. The training set has been undersampled for you, such that 1/3 of the training set consists of defaults, and 2/3 of non-defaults. The resulting data set is available in your workspace and named undersampled_training_set
, and contains less observations (6570 instead of 19394). In this exercise, you will create a decision tree using the undersampled data set.
You will notice that the trees in this and the next exercises are very big, so big that you cannot really read them anymore. Don't worry about this for now, we will tell you how you can make them more manageable in the next video!
This exercise is part of the course
Credit Risk Modeling in R
Exercise instructions
- The rpart package has been installed for you. Load the package in your workspace.
- Change the code provided such that a decision tree is constructed using the undersampled training set instead of
training_set
. Additionally, add the argumentcontrol = rpart.control(cp = 0.001)
.cp
, which is the complexity parameter, is the threshold value for a decrease in overall lack of fit for any split. Ifcp
is not met, further splits will no longer be pursued.cp
's default value is 0.01, but for complex problems, it is advised to relaxcp
. - Plot the decision tree using the function plot and the tree object name. Add a second argument
uniform = TRUE
to get equal-sized branches. - The previous command simply creates a tree with some nodes and edges, but without any text (or so-called "labels") on it. Use function
text()
with sole argumenttree_undersample
to add labels.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load package rpart in your workspace.
# Change the code provided in the video such that a decision tree is constructed using the undersampled training set. Include rpart.control to relax the complexity parameter to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
data = training_set)
# Plot the decision tree
# Add labels to the decision tree