Binary data and logistic regression

1. Binary data and logistic regression

In the previous chapter, you learned about the components of GLMs. In this chapter, we embark on a specific journey learning about modeling binary data using logistic regression. In this video, we will cover binary data and how logistic regression is defined and computed.

2. Binary response data

Binary data is one of the most common response data types. As the name suggests it is a two-class category, which we usually denote as 0 or 1. 0s or 1s can have many different meanings depending on the underlying research problem. For example, in credit scoring, the loan can either default or not. Similarly, a student either passes or fails a test. In my work, I often use logistic regression to model the probability of default for retail and corporate loans.

3. Binary data

Binary data can occur in two-forms, ungrouped and grouped. Ungrouped data is represented by a single event, like the flip of a coin, with two outcomes having a Bernoulli distribution with probability p. This is a special case of Binomial with n equal to 1. Grouped data, on the other hand, represents multiple events occurring at the same time and measuring the number of successes in an n number of trials. Group data follows a binomial distribution. Usually, ungrouped data is used in studies.

4. Logistic function

Consider the following binary data used to predict the probability of passing a test given hours of studying.

5. Logistic function

where Pass is marked as 1 and Fail as 0. Mathematically we would like to predict the probability that the outcome y is in class 1. We know from previous videos that a linear fit is not adequate.

6. Logistic function

The S-shaped logistic function or the sigmoid curve comes to the rescue, fitting the data and providing the probability of the outcome given x. In the next video, we will learn how to interpret the function,

7. Odds and odds ratio

Now let's introduce another concept that is crucial for logistic regression, the odds. Odds by definition is the ratio of an event occurring to an event not occurring and the odds ratio is just a ratio of two odds.

8. Odds example

For example, given 4 games, the odds of winning a game are 3 to 1, meaning that the event win occurred 3 times and loss once.

9. Odds and probabilities

Odds are not probabilities but are directly related and can be computed from each other. For example, Odds are the ratio of the probability of the event to the probability of non-event. Writing probability in terms of odds helps us transform our initial unbounded probability model by removing the upper bound whereas the logarithm of odds removes the lower bound.

10. From probability model to logistic regression

Let's review all the steps that make logistic regression. First, recall our initial model from chapter 1, which we couldn't fit with linear unbounded function. Applying the logistic function to the model provides the necessary bounds required for our binary data. We compute mu, the estimated probability but also the probability of the event not occurring, i.e. 1 minus mu.

11. From probability model to logistic regression

Finally, computing the odds in terms of probability with log transformation is central to logistic regression. It provides for many desirable properties as in linear regression, which we will see later on. Note that the logit is linear in its parameters, can range from minus infinity to infinity depending on the range of x.

12. Logistic regression in Python

From the previous video recall the Python glm function, which contains all the steps we just covered in the previous 2 slides. For binary data, the Binomial distribution with the default logit link function is used. Inputs can be binary 0 or 1, or two-level factor such as Yes, No.

13. Let's practice!

Time to build a model!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.