1. Adjusting your algorithm weights
In this video, you'll learn how to adjust your model parameters to optimize for fraud detection.
2. Balanced weights
When training a model for fraud detection, you want to try different options and setting to get the best recall-precision tradeoff possible. In scikit-learn there are two simple options to tweak your model for heavily imbalanced fraud data. There is the balanced mode, and balanced_subsample mode, that you can assign to the weight argument when defining the model.
The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data. The balanced_subsample mode is the same as the balanced option, except that weights are calculated again at each iteration of growing a tree in the random forest. This latter option is therefore only applicable for the random forest model.
The balanced option is however also available for many other classifiers, for example the logistic regression has the option, as well as the SVM model.
3. Hyperparameter tuning for fraud detection
The weight option also takes a manual input. This allows you to adjust weights not based on the value counts relative to sample, but to whatever ratio you like. So if you just want to upsample your minority class slightly, then this is a good option. All the classifiers that have the weight option available should have this manual setting also.
Moreover, the random forest takes many other options you can use to optimize the model; you call this hyperparameter tuning.
You can, for example, change the shape and size of the trees in the random forest by adjusting leaf size and tree depth. One of the most important settings are the number of trees in the forest, called number of estimators, and the number of features considered for splitting at each leaf node, indicated by max_features.
Moreover, you can change the way the data is split at each node, as the default is to split on the gini coefficient. Without going into too much detail now, I encourage you to research the different options when working with this model.
4. Using GridSearchCV
A smarter way of hyperparameter tuning your model is to use GridSearchCV. You should have come across this in the course on Supervised Learning.
Let's import the package first.
GridSearchCV evaluates all combinations of parameters we define in the parameter grid. This is an example of a parameter grid specifically for a random forest model.
Let's define the machine learning model we'll use. And now, let's put it into a grid search. You pass in the model, the parameter grid, and we'll tell it how often to cross-validate. Most importantly, you need to define a scoring metric to evaluate the models on. This is incredibly important in fraud detection. The default option here would be accuracy, so if you don't define this, your models are ranked based on accuracy, which you already know is useless. You therefore need to pass the option precision, recall, or F1 here. Let's go with F1 for this example.
5. Finding the best model with GridSearchCV
Once you have fitted your GridSearchCV and model to the data, you can obtain the parameters belonging to the optimal model by using the best parameters function. Mind you, GridSearchCV is computationally very heavy to run. Depending on the size of your data, and number of parameters in the grid, this can take up to many hours to complete, so make sure to save the results.
You can easily get the results for the best_estimator that gives the best_score; these results are all stored. The best score is the mean cross-validated score of the best_estimator, which of course also depends on the scoring option you gave earlier. As you chose F1 before, you'll get the best F1 sore here.
6. Let's practice!
Let's practice!