1. Why you need logistic regression
The datasets you've seen so far all had a numeric response variable. Now we'll explore the case of a binary response variable.
2. Bank churn dataset
Consider this dataset on churn at a European financial services company in 2006. There are 400 rows, each representing a customer. If the customer closed all accounts during the time period, they were considered to have churned, and that column is marked with a one. If they still had an open account at the end of the time period, has_churned is marked with a zero.
Using one and zero for the response instead of a logical variable makes the plotting code easier.
The two explanatory variables are the time since the customer first bought a service, and the time since they last bought a service.
Respectively, they measure the length of relationship with the customer, and the recency of the customer's activity.
The time columns contain negative values because they have been standardized for confidentiality reasons.
3. Churn vs. recency: a linear model
Let's run a linear model of churn versus recency and see what happens.
We can use the coefficients function to pull out the intercept and slope. The intercept is about point-five and the slope is slightly positive at zero-point-zero-six.
4. Visualizing the linear model
Here's a plot of the data points with the linear trend. I used geom_abline rather than geom_smooth so the line isn't limited to the extent of the data.
All the churn values are zero or one, but the model predictions are fractional. You can think of the predictions as being probabilities that the customer will churn.
5. Zooming out
Zooming out by setting axis limits with xlim and ylim shows the problem with using a linear model. In the bottom-left of the plot, the model predicts negative probabilities. In the top-right, the model predicts probabilities greater than one.
Both situations are impossible.
6. What is logistic regression?
The solution is to use logistic regression models, which are a type of generalized linear model, used when the response variable is logical.
Whereas linear models result in predictions that follow a straight line, logistic models result in predictions that follow a logistic curve, which is S-shaped.
7. Linear regression using glm()
Before we run a logistic regression, it's worth noting that you can run a linear regression using the glm function, for generalized linear models. Replace lm with glm and set the family argument to gaussian.
family specifies the family of distributions used for the residuals. You can pass it with or without quotes.
Here, the coefficients are the same as before.
8. Logistic regression: glm() with binomial family
To run a logistic regression, you also call glm. This time, set the family argument to binomial to specify residuals from the binomial distribution.
As before, you get two coefficients, an intercept and one for the numerical explanatory variable. The interpretation is a little different; we'll come to that later.
9. Visualizing the logistic model
Let's add the glm predictions to the plot. ggplot will draw a logistic regression trend line with geom_smooth, shown in blue here. Notice that the prediction line is slightly curved.
Look closely at the differences from our previous use of geom_smooth. In the method argument, use "glm" not "lm". You also need to add a method-dot-args argument, containing a list of the other arguments passed to glm. In this case, you need to set family equals binomial inside the list.
10. Zooming out
Now zooming out shows that the logistic regression curve never goes below zero or above one.
To interpret this curve, when the standardized time since last purchase is very small, the probability of churning is close to zero. When the time since last purchase is very high, the probability is close to one. That is, customers who recently bought things are less likely to churn.
11. Let's practice!
Let's get logistic on this dataset.