Avoiding class imbalances

Some data contains very imbalanced outcomes - like a rare disease dataset. When splitting randomly, you might end up with a very unfortunate split. Imagine all the rare observations are in the test and none in the training set. That would ruin your whole training process!

Fortunately, the initial_split() function provides a remedy. You are going to observe and solve these so-called class imbalances in this exercise.

There is already code provided to create a split object diabetes_split with a 75% training and 25% test split.

1
- Count the proportion of "yes" outcomes in the training and test sets of diabetes_split.

2
- Redesign diabetes_split using the same training/testing proportion, but the outcome variable being equally distributed in both sets.
- Count the proportion of yes outcomes in both datasets.

Exercise

Avoiding class imbalances

Instructions 1/2

.css-6su6fj{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;}Exercise

Instructions 1/2

Exercise