K-fold cross-validation
You will start by getting hands-on experience in the most commonly used K-fold cross-validation.
The data you'll be working with is from the "Two sigma connect: rental listing inquiries" Kaggle competition. The competition problem is a multi-class classification of the rental listings into 3 classes: low interest, medium interest and high interest. For faster performance, you will work with a subsample consisting of 1,000 observations.
You need to implement a K-fold validation strategy and look at the sizes of each fold obtained. train
DataFrame is already available in your workspace.
This exercise is part of the course
Winning a Kaggle Competition in Python
Exercise instructions
- Create a
KFold
object with 3 folds. - Loop over each split using the
kf
object. - For each split select training and testing folds using
train_index
andtest_index
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import KFold
from sklearn.model_selection import KFold
# Create a KFold object
kf = ____(n_splits=____, shuffle=True, random_state=123)
# Loop through each split
fold = 0
for train_index, test_index in ____.____(train):
# Obtain training and testing folds
cv_train, cv_test = train.iloc[____], train.iloc[____]
print('Fold: {}'.format(fold))
print('CV train shape: {}'.format(cv_train.shape))
print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
fold += 1