1. Confusion matrix
A really useful tool for evaluating binary classification models is known as a "confusion matrix". This is a matrix of the model's predicted classes vs the actual outcomes in reality.
It's called a confusion matrix because it reveals how "confused" the model is between the 2 classes, and highlights instances in which one class is confused for the other.
2. Confusion matrix
The columns of the confusion matrix are the true classes, while the rows of the confusion matrix are the predicted classes. From left-to-right, top-to-bottom, the cells of the matrix are: true positives, false positives, false negatives, and true negatives.
The main diagonal of the confusion matrix is the cases where the model is correct (true positives and true negatives) and the second diagonal of the confusion matrix is the cases where the model is incorrect (false negatives and false positives).
Let's briefly review the 4 possible outcomes with a binary classification model: True positives are cases where the model correctly predicted yes. False positives are cases where the model incorrectly predicted yes. False negatives are cases where the model incorrectly predicted no. And True negatives are cases where the model correctly predicted no.
All 4 of these outcomes are important when evaluating a predictive model's accuracy, so it's useful to look at them simultaneously in a single table.
3. Confusion matrix
To generate a confusion matrix, we start by fitting a model to our training set. In this case, we'll use a simple logistic regression model.
Next, we predict on the test set, and cut the predicted probabilities with a threshold to get class assignments.
In other words, the logistic regression model outputs the probability that an object is a mine, but we need to use these probabilities to make a binary decision: rock or mine.
In the simplest case, we use a probability of 50% as our cutoff, and assign anything under 50% as a rock and anything over 50% as a mine.
4. Confusion matrix
Next we make a 2-way frequency table, using the "table" function in R. This table reveals a high number of false positives and false negatives: our model is frequently wrong.
5. Confusion matrix
Rather than calculate our error rate by hand, we'll now let the "confusionMatrix" function in caret do it for us. This function provides the same 2-way frequency table as the table function in base R, but outputs a number of useful statistics as well.
The most useful statistic in this table is the accuracy, which is not very impressive.
Compare this to the "no information rate" or the case where we always predict the dominant class, which is mines. At about 50%, the no information rate reveals that using a dummy model that always predicts mines is more accurate than our logistic regression!
6. Let's practice!
Let's practice calculating confusion matrices.