Get startedGet started for free

Undersampling training data

It's time to undersample the training set yourself with a few lines of code from Pandas. Once the undersampling is complete, you can check the value counts for loan_status to verify the results.

X_y_train, count_nondefault, and count_default are already loaded in the workspace. They have been created using the following code:

X_y_train = pd.concat([X_train.reset_index(drop = True),
                       y_train.reset_index(drop = True)], axis = 1)
count_nondefault, count_default = X_y_train['loan_status'].value_counts()

The .value_counts() for the original training data will print automatically.

This exercise is part of the course

Credit Risk Modeling in Python

View Course

Exercise instructions

  • Create data sets of non-defaults and defaults stored as nondefaults and defaults.
  • Sample the nondefaults to the same number as count_default and store it as nondefaults_under.
  • Concatenate nondefaults and defaults using .concat() and store it as X_y_train_under.
  • Print the .value_counts() of loan status for the new data set.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create data sets for defaults and non-defaults
____ = ____[____[____] == 0]
____ = ____[____[____] == 1]

# Undersample the non-defaults
____ = nondefaults.sample(____)

# Concatenate the undersampled nondefaults with defaults
____ = pd.____([____.reset_index(drop = True),
                             ____.reset_index(drop = True)], axis = 0)

# Print the value counts for loan status
print(____[____].____())
Edit and Run Code