GridSearchCV to find optimal parameters
In this exercise you're going to tweak our model in a less "random" way, but use GridSearchCV
to do the work for you.
With GridSearchCV
you can define which performance metric to score the options on. Since for fraud detection we are mostly interested in catching as many fraud cases as possible, you can optimize your model settings to get the best possible Recall score. If you also cared about reducing the number of false positives, you could optimize on F1-score, this gives you that nice Precision-Recall trade-off.
GridSearchCV
has already been imported from sklearn.model_selection
, so let's give it a try!
This exercise is part of the course
Fraud Detection in Python
Exercise instructions
- Define in the parameter grid that you want to try 1 and 30 trees, and that you want to try the
gini
andentropy
split criterion. - Define the model to be simple RandomForestClassifier, you want to keep the random_state at 5 to be able to compare models.
- Set the
scoring
option such that it optimizes for recall. - Fit the model to the training data
X_train
andy_train
and obtain the best parameters for the model.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define the parameter sets to test
param_grid = {'n_estimators': [____, ____], 'max_features': ['auto', 'log2'], 'max_depth': [4, 8], 'criterion': ['____', '____']
}
# Define the model to use
model = ____(random_state=5)
# Combine the parameter sets with the defined model
CV_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='____', n_jobs=-1)
# Fit the model to our training data and obtain best parameters
CV_model.fit(____, ____)
CV_model.____