Logistic regression and the ROC curve

1. Logistic regression and the ROC curve

It's time to introduce another model: logistic regression.

2. Logistic regression for binary classification

Despite its name, logistic regression is used for classification. This model calculates the probability, p, that an observation belongs to a binary class. Using our diabetes dataset as an example, if p is more than or equal to zero-point-five, we label the data as one, representing a prediction that an individual is more likely to have diabetes; if p is less than zero-point-five, we label it zero to represent that they are more likely to not have diabetes.

3. Linear decision boundary

Note that logistic regression produces a linear decision boundary, as we can see in this image.

4. Logistic regression in scikit-learn

Using logistic regression in scikit-learn follows the same approach as used for other models. We first import LogisticRegression from sklearn-dot-linear_model. Next we instantiate the classifier, split our data, fit the model on our training data, and predict on our test set. In this video we use the churn dataset.

5. Predicting probabilities

We can predict probabilities of each instance belonging to a class by calling logistic regression's predict_proba method and passing the test features. This returns a 2-dimensional array with probabilities for both classes, in this case, that the individual did not churn, or did churn, respectively. We slice the second column, representing the positive class probabilities, and store the results as y_pred_probs. Here we see the model predicts a probability of point-zero-eight-nine that the first observation has churned.

6. Probability thresholds

The default probability threshold for logistic regression in scikit-learn is zero-point-five. This threshold can also apply to other models such as KNN. So what happens as we vary this threshold?

7. The ROC curve

We can use a receiver operating characteristic, or ROC curve, to visualize how different thresholds affect true positive and false positive rates. Here, the dotted line represents a chance model, which randomly guesses labels.

8. The ROC curve

When the threshold equals zero, the model predicts one for all observations, meaning it will correctly predict all positive values, and incorrectly predict all negative values.

9. The ROC curve

If the threshold equals one, the model predicts zero for all data, which means that both true and false positive rates

10. The ROC curve

are zero. If we

11. The ROC curve

vary the threshold, we get a series of different false positive and true positive rates.

12. The ROC curve

A line plot of the thresholds helps to visualize the trend.

13. Plotting the ROC curve

To plot the ROC curve, we import roc_curve from sklearn-dot-metrics. We then call the function roc_curve; we pass the test labels as the first argument, and the predicted probabilities as the second. We unpack the results into three variables: false positive rate, FPR; true positive rate, TPR; and the thresholds. We can then plot a dotted line from zero to one, along with the FPR and TPR;

14. Plotting the ROC curve

to produce a figure such as this. This looks great, but how do we quantify the model's performance based on this plot?

15. ROC AUC

If we have a model with one for true positive rate and zero for false positive rate, this would be the perfect model. Therefore, we calculate the area under the ROC curve, a metric known as AUC. Scores range from zero to one, with one being ideal. Here, the model scores point-six-seven, which is only 34% better than a model making random guesses.

16. ROC AUC in scikit-learn

We can calculate AUC in scikit-learn by importing roc_auc_score from sklearn-dot-metrics. We call roc_auc_score, passing our test labels and our predicted probabilities, calculated by using the model's predict_proba method on X_test. As expected, we get a score of zero-point-six-seven.

17. Let's practice!

Now let's build a logistic regression model and evaluate its performance!