LoslegenKostenlos loslegen

Stratified sampling

You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

Diese Übung ist Teil des Kurses

Preprocessing for Machine Learning in Python

Kurs anzeigen

Anleitung zur Übung

  • Create a DataFrame of features, X, with all of the columns except category_desc.
  • Create a DataFrame of labels, y from the category_desc column.
  • Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets
  • Print the labels and counts in y_train using .value_counts().

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

# Create a DataFrame with all columns except category_desc
X = volunteer.____(____, axis=____)

# Create a category_desc labels dataset
y = ____[[____]]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = ____(____, ____, ____, random_state=42)

# Print the category_desc counts from y_train
print(____[____].____)
Code bearbeiten und ausführen