Undersampling training data
It's time to undersample the training set yourself with a few lines of code from Pandas
. Once the undersampling is complete, you can check the value counts for loan_status
to verify the results.
X_y_train
, count_nondefault
, and count_default
are already loaded in the workspace. They have been created using the following code:
X_y_train = pd.concat([X_train.reset_index(drop = True),
y_train.reset_index(drop = True)], axis = 1)
count_nondefault, count_default = X_y_train['loan_status'].value_counts()
The .value_counts()
for the original training data will print automatically.
This exercise is part of the course
Credit Risk Modeling in Python
Exercise instructions
- Create data sets of non-defaults and defaults stored as
nondefaults
anddefaults
. - Sample the
nondefaults
to the same number ascount_default
and store it asnondefaults_under
. - Concatenate
nondefaults
anddefaults
using.concat()
and store it asX_y_train_under
. - Print the
.value_counts()
of loan status for the new data set.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create data sets for defaults and non-defaults
____ = ____[____[____] == 0]
____ = ____[____[____] == 1]
# Undersample the non-defaults
____ = nondefaults.sample(____)
# Concatenate the undersampled nondefaults with defaults
____ = pd.____([____.reset_index(drop = True),
____.reset_index(drop = True)], axis = 0)
# Print the value counts for loan status
print(____[____].____())