Avoiding class imbalances
Some data contains very imbalanced outcomes - like a rare disease dataset. When splitting randomly, you might end up with a very unfortunate split. Imagine all the rare observations are in the test and none in the training set. That would ruin your whole training process!
Fortunately, the initial_split()
function provides a remedy. You are going to observe and solve these so-called class imbalances in this exercise.
There is already code provided to create a split object diabetes_split
with a 75% training and 25% test split.
This exercise is part of the course
Machine Learning with Tree-Based Models in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Preparation
set.seed(9888)
diabetes_split <- initial_split(diabetes, prop = 0.75)
# Proportion of 'yes' outcomes in the training data
counts_train <- table(training(___)$outcome)
prop_yes_train <- counts_train["___"] / sum(counts_train)
# Proportion of 'yes' outcomes in the test data
counts_test <- table(___)
prop_yes_test <- ___ / sum(___)
paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))