Get startedGet started for free

Multiple logistic regression

1. Multiple logistic regression

Let's switch from linear regression to logistic regression.

2. Bank churn dataset

We'll revisit the three-column bank churn dataset from the previous course. has_churned is the response, denoting whether or not the customer churned, time_since_first_purchase is a measure of the length of relationship with the customer, and time_since_last_purchase is a measure of the recency of activity of the customer. The explanatory variables have been transformed to protect commercially sensitive information.

3. glm()

Recall that to perform a logistic regression, there are two changes compared to a linear regression. Firstly, you call glm, for generalized linear models, rather than lm. Secondly, you include a family argument to specify the error distribution, set to binomial. You'll explore the binomial function later in the chapter. To extend logistic regression to multiple explanatory variables, you change the formula in the same way as linear regression, with a plus to ignore interactions, or a times to include interactions. There's no new syntax here.

4. Prediction flow

The prediction flow should also feel familiar, since you've seen all the techniques already. Use expand_grid to make a tibble of explanatory variables, then mutate to add a column of predictions. The only change from the linear regression case is that you need to specify type equals "response" in the call to predict.

5. The four outcomes

Recall that when the response variable has two possible values, there are four outcomes for the model. Either it correctly predicts positive and negative responses, or it gets it wrong with a false positive or false negative. We can quantify and visualize these four outcomes using a confusion matrix.

6. Confusion matrix

The code flow is the same as before. Get the actual responses from the dataset and the predicted responses from the model, rounded to give zeroes and ones. Then use table to get counts of each of the four outcomes. yardstick's conf_mat function converts it to a confusion matrix object. That lets you plot the result as a mosaic plot using autoplot, and display metrics like model accuracy, sensitivity and specificity, using summary.

7. Visualization

Visualizing the plot when you have multiple explanatory variables is trickier. As with linear regression visualizations, you can use faceting to provide different panels for categorical variables. For the case of two numeric explanatory variables, you can use the technique of mapping the response variable to color. A nice trick here is to give predicted probabilities less than zero-point-five one color, and predicted probabilities above zero-point-five another color. This is achieved with ggplot2's scale_color_gradient2 function, setting midpoint to zero-point-five. You'll see how it looks in the exercises.

8. Let's practice!

Let's get logistic!