1. Logistic regression and the ROC curve
It's time to introduce another model: logistic regression.
2. Logistic regression for binary classification
Despite its name, logistic regression is used for classification.
This model calculates the probability, p, that an observation belongs to a binary class.
Using our diabetes dataset as an example, if p is more than or equal to zero-point-five, we label the data as one, representing a prediction that an individual is more likely to have diabetes; if p is less than zero-point-five, we label it zero to represent that they are more likely to not have diabetes.
3. Linear decision boundary
Note that logistic regression produces a linear decision boundary, as we can see in this image.
4. Logistic regression in scikit-learn
Using logistic regression in scikit-learn follows the same approach as used for other models. We first import LogisticRegression from sklearn-dot-linear_model.
Next we instantiate the classifier, split our data, fit the model on our training data, and predict on our test set.
In this video we use the churn dataset.
5. Predicting probabilities
We can predict probabilities of each instance belonging to a class by calling logistic regression's predict_proba method and passing the test features. This returns a 2-dimensional array with probabilities for both classes, in this case, that the individual did not churn, or did churn, respectively.
We slice the second column, representing the positive class probabilities, and store the results as y_pred_probs.
Here we see the model predicts a probability of point-zero-eight-nine that the first observation has churned.
6. Probability thresholds
The default probability threshold for logistic regression in scikit-learn is zero-point-five.
This threshold can also apply to other models such as KNN.
So what happens as we vary this threshold?
7. The ROC curve
We can use a receiver operating characteristic, or ROC curve, to visualize how different thresholds affect true positive and false positive rates. Here, the dotted line represents a chance model, which randomly guesses labels.
8. The ROC curve
When the threshold equals zero, the model predicts one for all observations, meaning it will correctly predict all positive values, and incorrectly predict all negative values.
9. The ROC curve
If the threshold equals one, the model predicts zero for all data, which means that both true and false positive rates
10. The ROC curve
are zero. If we
11. The ROC curve
vary the threshold, we get a series of different false positive and true positive rates.
12. The ROC curve
A line plot of the thresholds helps to visualize the trend.
13. Plotting the ROC curve
To plot the ROC curve, we import roc_curve from sklearn-dot-metrics.
We then call the function roc_curve; we pass the test labels as the first argument, and the predicted probabilities as the second.
We unpack the results into three variables: false positive rate, FPR; true positive rate, TPR; and the thresholds.
We can then plot a dotted line from zero to one, along with the FPR and TPR;
14. Plotting the ROC curve
to produce a figure such as this. This looks great, but how do we quantify the model's performance based on this plot?
15. ROC AUC
If we have a model with one for true positive rate and zero for false positive rate, this would be the perfect model.
Therefore, we calculate the area under the ROC curve, a metric known as AUC. Scores range from zero to one, with one being ideal. Here, the model scores point-six-seven, which is only 34% better than a model making random guesses.
16. ROC AUC in scikit-learn
We can calculate AUC in scikit-learn by importing roc_auc_score from sklearn-dot-metrics.
We call roc_auc_score, passing our test labels and our predicted probabilities, calculated by using the model's predict_proba method on X_test.
As expected, we get a score of zero-point-six-seven.
17. Let's practice!
Now let's build a logistic regression model and evaluate its performance!