Exercise

Stratified sampling

You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

Instructions

100 XP
  • Create a DataFrame of features, X, with all of the columns except category_desc.
  • Create a DataFrame of labels, y from the category_desc column.
  • Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets
  • Print the labels and counts in y_train using .value_counts().