1. Model evaluation: imbalanced classification models
Welcome back! I'm glad to see you!
2. Class imbalance
An imbalanced class problem implies that the ML model has a categorical target variable. Most ML algorithms work best when there is approximately equal number of observations in the available classes. When there is a large difference between the number of observations in each class, it can cause misleading results, especially given the fact that most algorithms are designed to reduce error while maximizing accuracy.
3. Confusion matrix
A confusion matrix shows the number of correctly and incorrectly classified observations in each class.
4. Performance metrics
Accuracy measures how accurate the model classifications are overall and is calculated by the sum of the true negatives and true positives divided by all of the observations. However, when evaluating a dataset with imbalanced classes, accuracy is not the best metric to use. So what should you use instead? A closer look at the confusion matrix can be insightful and used to calculate better metrics in the case of imbalanced classes.
5. Metrics from the matrix
Precision, calculated by the number of true positives divided by the sum of false positives and true positives, measures how often the model is correct when it predicts the positive class.
Low precision indicates a high number of false positives.
Recall, also called sensitivity, is a measure of how often a positive is predicted when an observation is positive and is calculated by the number of true positives divided by the sum true positives and false negatives, which are just all of the positive observations in the data. It is also called True Positive Rate.
Low recall indicates a high number of false negatives.
The F1 score is the weighted average of precision and recall, also called the harmonic mean of precision and recall. It is calculated by taking the product of precision and recall times 2 and dividing it by the sum of precision and recall.
6. Resampling techniques
Resampling is a technique which tries to create more balance between the classes. It could either create more observations from the minority class, which is called oversampling, or only include a subset sample from the majority class, called undersampling.
Always split into train and test sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data, which we're always trying to avoid.
7. Functions
Some of the functions you'll encounter in the following exercises are Logistic Regression from sklearn dot linear underscore model. From sklearn dot metrics, you'll use confusion underscore matrix, precision underscore score, recall underscore score, and f1 underscore score, which return their respective performance metrics. You'll try out the resample function which takes a random sample from the first argument the length of the n underscore samples argument to practice both upsampling and downsampling.
These techniques will demonstrate how to handle the class imbalance in the loan dataset you've become so familiar with throughout this course.
8. Let's practice!
Alright, let's go find some balance, class balance that is!