Partitioning

In order to properly evaluate a model, one can partition the data in a train and test set. The train set contains the data the model is built on, and the test data is used to evaluate the model. This division is done randomly, but when the target incidence is low, it could be necessary to stratify, that is, to make sure that the train and test data contain an equal percentage of targets.

In this exercise you will partition the data with stratification and verify that the train and test data have equal target incidence. The train_test_split method has already been imported, and the X and y DataFrames are available in your workspace.

This exercise is part of the course

Introduction to Predictive Analytics in Python

View Course

Exercise instructions

  • Stratify these DataFrames using the train_test_split method. Make sure that train and test set are the same size, and have equal target incidence.
  • Calculate the target incidence of the train set. This is the number of targets in the train set divided by the number of observations in the train set.
  • Calculate the target incidence of the test set.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load the partitioning module
from sklearn.model_selection import train_test_split

# Create DataFrames with variables and target
X = basetable.drop("target", 1)
y = basetable["target"]

# Carry out 50-50 partititioning with stratification
X_train, X_test, y_train, y_test = ____(X, y, test_size = ____, stratify = ____)

# Create the final train and test basetables
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

# Check whether train and test have same percentage targets
print(round(sum(train[____])/len(____), 2))
print(round(sum(test[____])/len(____), 2))