Multiple logistic regression

1. Multiple logistic regression

In this last chapter, let's switch from linear regression to logistic regression.

2. Bank churn dataset

We'll revisit the three-column bank churn dataset from the previous course. has_churned is the response, denoting whether or not the customer churned, time_since_first_purchase is a measure of the length of relationship with the customer, and time_since_last_purchase is a measure of the recency of activity of the customer. The explanatory variables have been transformed to protect commercially sensitive information.

3. logit()

Recall that to perform a logistic regression in statsmodels, you use the logit function instead of ols. To extend logistic regression to multiple explanatory variables, you change the formula in the same way as linear regression, with a plus to ignore interactions, or a times to include interactions. There's no new syntax here.

4. The four outcomes

Recall that when the response variable has two possible values, there are four outcomes for the model. Either it correctly predicts positive and negative responses, or it gets it wrong with a false positive or false negative. We can quantify and visualize these four outcomes using a confusion matrix. The code for calling the prediction matrix is shown here, as we saw in the previous course. The confusion matrix lets you calculate metrics like model accuracy, sensitivity and specificity. You will do this in the exercises.

5. Prediction flow

The prediction flow of multiple logistic regression should also feel familiar, since you've seen all the techniques already. Use itertools' product function to create combinations of explanatory variables, store them in a DataFrame, then assign a new column of predictions.

6. Visualization

For visualization purposes, I also create a column with most likely outcomes. It holds the rounded values of the churn predictions: if the probability of churning is less than 0-point-5, the most likely outcome is that they won't churn. If their probability is greater than 0-point-5, it's more likely that they will churn. Then, two scatter plots are drawn: one for the actual churn data, and one for the prediction data, colored by most likely outcome. You'll see what it looks like in the exercises.

7. Let's practice!

So go ahead and get started!