Stratified sampling
You now know that the distribution of class labels in the category_desc
column of the volunteer
dataset is uneven. If you wanted to train a model to predict category_desc
, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!
This exercise is part of the course
Preprocessing for Machine Learning in Python
Exercise instructions
- Create a DataFrame of features,
X
, with all of the columns exceptcategory_desc
. - Create a DataFrame of labels,
y
from thecategory_desc
column. - Split
X
andy
into training and test sets, ensuring that the class distribution in the labels is the same in both sets - Print the labels and counts in
y_train
using.value_counts()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a DataFrame with all columns except category_desc
X = volunteer.____(____, axis=____)
# Create a category_desc labels dataset
y = ____[[____]]
# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = ____(____, ____, ____, random_state=42)
# Print the category_desc counts from y_train
print(____[____].____)