1. Logistic regression to predict probabilities
In this chapter, you will learn about regression in non-linear situations. This first lesson, covers regression to predict probabilities.
2. Predicting Probabilities
While predicting if an event will occur is a classification problem, we'll call predicting the numerical probability that it occurs regression.
However, unlike standard regression, probabilities can only be in the range 0-1.
Let's see an example.
3. Example: Predicting Duchenne Muscular Dystrophy (DMD)
In this example, we want to develop a test to detect the gene for DMD in women. The test uses the measurement of two enzymes in the blood, here labeled CK and H. What is the probability that a woman is a DMD carrier, based on her CK and H levels?
4. A Linear Regression Model
We can try a linear regression, where the outcome is 1 or TRUE for women who have the DMD gene, and 0 or FALSE otherwise. Unfortunately, the model predicts probabilities outside the range 0-1. We'd like a model that only predicts valid probabilities.
5. Logistic Regression
The counterpart to linear regression for predicting probabilities is logistic regression. Logistic regression assumes that the inputs are additive and linear in the log-odds of the outcome, where the odds is the ratio of the probability that an event occurs to the probability that it does not.
You fit logistic regression models in R with the glm function. glm looks a lot like lm; it takes as input a formula, a data frame, and a third argument called family.
The family argument describes the error distribution of the model; just remember that for logistic regression, use family = binomial.
6. DMD model
glm also assumes that there are two possible outcomes, a and b. The model returns the probability of event b. To make the model easier to understand, we recommend that you encode the two outcomes as 0/1 or FALSE and TRUE.
7. Interpreting Logistic Regression Models
Read the coefficients of a logistic regression as you do those for a linear model. If the coefficient is positive, then the event becomes more probable as that value increases, if everything else is held constant.
In our example, increased levels of both CK and H make the probability of the dmd gene higher.
8. Predicting with a glm() model
predict takes as inputs the model and a data frame. To get the probabilities, include the argument type = "response".
9. DMD Model
We can fit a logistic model to the dmd training data, and look at the predictions. Now all the predictions lie between zero and one. But how good are they?
10. Evaluating a logistic regression model: pseudo-$R^2$
Squared error and RMSE are not good measures for logistic regression models. Instead, use deviance and Pseudo R-squared. You can think of deviance as being similar to variance.
Pseudo-R-squared is analogous to R-squared. It compares the deviance of a model to the null-deviance of the data. A good fit gives pseudo-R-squared near 1.
11. Pseudo-$R^2$ on Training data
For training performance, you can calculate pseudo-R-squared using the deviance and null deviance from glance, or call wrapChiSqTest on the model.
12. Pseudo-$R^2$ on Test data
For test performance, call wrapChiSqTest on a data frame that has both the predictions and the true outcomes. You also have to designate the target event, which in our example is "TRUE".
13. The Gain Curve Plot
The gain curve plot is great for evaluating logistic regression models.
The wizard curve (in green) shows how a perfect model would sort the events: in our case, all the women with the dmd gene first. The blue curve shows how the model's probability scores would sort the events. We want women who carry the dmd gene to have higher probability scores than women who do not. The closer the blue curve is to the green one, the better the model's probability estimates are for identifying positive instances.
14. Let's practice!
Now let's practice fitting logistic regression models.