Stratified K-fold

As you've just noticed, you have a pretty different target variable distribution among the folds due to the random splits. It's not crucial for this particular competition, but could be an issue for the classification competitions with the highly imbalanced target variable.

To overcome this, let's implement the stratified K-fold strategy with the stratification on the target variable. train DataFrame is already available in your workspace.

Bu egzersiz

Winning a Kaggle Competition in Python

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Create a StratifiedKFold object with 3 folds and shuffling.
Loop over each split using str_kf object. Stratification is based on the "interest_level" column.
For each split select training and testing folds using train_index and test_index.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Import StratifiedKFold
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = ____(n_splits=____, shuffle=____, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in ____.____(train, train['interest_level']):
    # Obtain training and testing folds
    cv_train, cv_test = ____.iloc[____], ____.iloc[____]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1

Kodu Düzenle ve Çalıştır