Stratified sampling

You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

This exercise is part of the course

Preprocessing for Machine Learning in Python

View Course

Exercise instructions

Create a DataFrame of features, X, with all of the columns except category_desc.
Create a DataFrame of labels, y from the category_desc column.
Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets
Print the labels and counts in y_train using .value_counts().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create a DataFrame with all columns except category_desc
X = volunteer.____(____, axis=____)

# Create a category_desc labels dataset
y = ____[[____]]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = ____(____, ____, ____, random_state=42)

# Print the category_desc counts from y_train
print(____[____].____)

Edit and Run Code