1. Overview of logistic regression
During this chapter, you'll learn about another type of GLM: Logistic regression.
This is a powerful tool I often use as a data scientist to model binary data and is a specific type of binomial GLM.
2. Example toxicology study
As a scientist, I often use logistic regression for toxicology studies and animal behavior studies.
3. Sports analytics
My friend who works in sports analytics uses it to predict which teams will win games.
4. Online sellers
Another friend uses logistic regression to model selling strategies for 3rd-party sellers using online shopping sites.
5. Chapter overview
In this chapter, we will start with an overview of logistic regression, then cover two types of probability inputs for logistic regression, and last compare two different GLM link functions.
6. Why use logistic regression?
Many data scientists including myself often use logistic regression to model data with two possible outcomes.
This can include binary data with zeros-and-ones.
Or, survival data such as dead or alive.
I also have used a logistic regression when modeling two choices or behaviors.
We can also use logistic regression to model outcomes such as passing a test, winning a coin toss, or winning a sports game.
7. What is logistic regression?
Logistic regression is a special type of binomial GLM and the default option in R.
The model is comprised of two parts.
First, the model estimates the probability p of a binary outcome Y.
Second, this probability is linked to a linear equation using the logit function.
The logit function takes probabilities, which are bounded by zero and one and links or transforms the probabilities to real numbers from negative to positive infinity.
This transformation allows for increased numerical stability and creates a linear equation.
This resulting linear equation is a regression that we know and love.
The regression includes an intercept, beta0, a slope beta1, x, and error term epsilon.
8. Logit function
The logit function converts probability to "log-odds" a form of probabilities on the log-scale.
The inverse logit function converts back from log-odds to the probability scale.
You will see link functions more in this chapter when comparing a logit-link to the probit-link and logits more in chapter 3 when we cover odds-ratios.
Now, enough with theory, let's see how to code logistic regression in R.
9. How to run logistic regression
We can fit a logistic regression in R using the GLM function.
Notice we use the same formula and data arguments as before.
Now, however, we must specify the binomial family.
The default binomial family link function is a logistic regression and we cover other link functions in the next section of this chapter.
The input variable for a logistic regression can either be a binary, 0/1 vector, or a two-level factor such as yes/no or win/lose.
You will see other input options later in this chapter.
10. Riding the bus?
An example of a dataset where you would use a logistic regression is if you were modeling what makes people commuting to work more likely to ride the bus.
The response variable would be "yes" they ride the bus or "no" they do not ride the bus.
You might wonder if the number of days per week someone commutes affects the chance that they ride the bus.
Do people who commute more ride the bus more? Maybe to save money?
Or, do people who commute more ride the bus less? Maybe to avoid the delays of mass transit?
You will get to examine this relationship using 2015 commuter data from Pittsburgh, PA, USA.
This dataset contains two columns of interest: the number of days per week a person commutes and if they rode the bus.
11. Let's practice!
Now, let's dive into the bus data with glm()s!