K-fold cross-validation
You will work with a binary classification problem on a subsample from Kaggle playground competition. The objective of this competition is to predict whether a famous basketball player Kobe Bryant scored a basket or missed a particular shot.
Train data is available in your workspace as bryant_shots
DataFrame. It contains data on 10,000 shots with its properties and a target variable "shot\_made\_flag"
-- whether shot was scored or not.
One of the features in the data is "game_id"
-- a particular game where the shot was made. There are 541 distinct games. So, you deal with a high-cardinality categorical feature. Let's encode it using a target mean!
Suppose you're using 5-fold cross-validation and want to evaluate a mean target encoded feature on the local validation.
This exercise is part of the course
Winning a Kaggle Competition in Python
Exercise instructions
- To achieve this, you need to repeat encoding procedure for the
"game_id"
categorical feature inside each folds split separately. Your goal is to specify all the missing parameters for themean_target_encoding()
function call inside each folds split. - Recall that the
train
andtest
parameters expect the train and test DataFrames. - While the
target
andcategorical
parameters expect names of the target variable and categorical feature to be encoded.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)
# For each folds split
for train_index, test_index in kf.split(bryant_shots):
cv_train, cv_test = bryant_shots.iloc[train_index], bryant_shots.iloc[test_index]
# Create mean target encoded feature
cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
test=____,
target='shot_made_flag',
categorical='____',
alpha=5)
# Look at the encoding
print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))