CommencerCommencer gratuitement

Stratified sampling

You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

Cet exercice fait partie du cours

Preprocessing for Machine Learning in Python

Afficher le cours

Instructions

  • Create a DataFrame of features, X, with all of the columns except category_desc.
  • Create a DataFrame of labels, y from the category_desc column.
  • Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets
  • Print the labels and counts in y_train using .value_counts().

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Create a DataFrame with all columns except category_desc
X = volunteer.____(____, axis=____)

# Create a category_desc labels dataset
y = ____[[____]]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = ____(____, ____, ____, random_state=42)

# Print the category_desc counts from y_train
print(____[____].____)
Modifier et exécuter le code