Stratified sampling

You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

Deze oefening maakt deel uit van de cursus

Preprocessing for Machine Learning in Python

Cursus bekijken

Oefeninstructies

Create a DataFrame of features, X, with all of the columns except category_desc.
Create a DataFrame of labels, y from the category_desc column.
Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets
Print the labels and counts in y_train using .value_counts().

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Create a DataFrame with all columns except category_desc
X = volunteer.____(____, axis=____)

# Create a category_desc labels dataset
y = ____[[____]]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = ____(____, ____, ____, random_state=42)

# Print the category_desc counts from y_train
print(____[____].____)

Code bewerken en uitvoeren