CommencerCommencer gratuitement

Avoiding class imbalances

Some data contains very imbalanced outcomes - like a rare disease dataset. When splitting randomly, you might end up with a very unfortunate split. Imagine all the rare observations are in the test and none in the training set. That would ruin your whole training process!

Fortunately, the initial_split() function provides a remedy. You are going to observe and solve these so-called class imbalances in this exercise.

There is already code provided to create a split object diabetes_split with a 75% training and 25% test split.

Cet exercice fait partie du cours

Machine Learning with Tree-Based Models in R

Afficher le cours

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Preparation
set.seed(9888)
diabetes_split <- initial_split(diabetes, prop = 0.75)

# Proportion of 'yes' outcomes in the training data
counts_train <- table(training(___)$outcome)
prop_yes_train <- counts_train["___"] / sum(counts_train)

# Proportion of 'yes' outcomes in the test data
counts_test <- table(___)
prop_yes_test <- ___ / sum(___)

paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))
Modifier et exécuter le code