Session Ready
Exercise

Stratified sampling

We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

Instructions
100 XP
  • Create a volunteer_X dataset with all of the columns except category_desc.
  • Create a volunteer_y training labels dataset.
  • Split up the volunteer_X dataset using scikit-learn's train_test_split function and passing volunteer_y into the stratify= parameter.
  • Take a look at the category_desc value counts on the training labels.