Get startedGet started for free

Avoiding class imbalances

Some data contains very imbalanced outcomes - like a rare disease dataset. When splitting randomly, you might end up with a very unfortunate split. Imagine all the rare observations are in the test and none in the training set. That would ruin your whole training process!

Fortunately, the initial_split() function provides a remedy. You are going to observe and solve these so-called class imbalances in this exercise.

There is already code provided to create a split object diabetes_split with a 75% training and 25% test split.

This exercise is part of the course

Machine Learning with Tree-Based Models in R

View Course

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Preparation
set.seed(9888)
diabetes_split <- initial_split(diabetes, prop = 0.75)

# Proportion of 'yes' outcomes in the training data
counts_train <- table(training(___)$outcome)
prop_yes_train <- counts_train["___"] / sum(counts_train)

# Proportion of 'yes' outcomes in the test data
counts_test <- table(___)
prop_yes_test <- ___ / sum(___)

paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))
Edit and Run Code