1. Logistic regression: introduction
Congratulations! We have now preprocessed the data. We have taken out outliers, coarse classified both employment length and interest rate, and split the data into a training and a test set.
2. Final data structure
Let's have a look at the structure of the training set. Remember that loan_status is our response variable, and we now have four-factor variables and three continuous variables as explanatory variables.
3. What is logistic regression?
In this chapter, we will discuss logistic regression. As the name suggests, logistic regression is similar in many ways to linear regression, except that the output of the model is a value between zero and one. This is necessary as we are interested in predicting the probability of default, which, by definition, is between zero and one.
In a logistic regression model, the probability of default can also be written as the probability that the loan status is equal to one, conditional on the variables x_1 to x_m, which in the case of our data set are loan amount, grade, age, et cetera. Additionally, some parameters beta_0 to beta_m, are estimated. The combination of the parameters and the variables is called the linear predictor.
4. Fitting a logistic model in R
You can fit a logistic regression model and obtain the parameter estimates in R using the glm() function, which stands for generalized linear model, with the family argument equal to "binomial". Let's look at an example in which we only include the variable age as a predictor. When looking at the result, we get some coefficients and model diagnostics. For now, we'll focus on the coefficients. The intercept is the estimate for beta 0, and the value underage is the estimate for beta 1. So how do we interpret these numbers?
To answer this question, we first take a step back and do some basic math.
5. Probabilities of default
You have seen the expression for the probability of default. By multiplying this expression with the exponential function of the linear predictor in both numerator and denominator, we get this result. The probability that the loan_status is equal to zero, or the probability of non-default, is given by one minus the probability of default. Rewriting the expression gives this result. Now, if we divide the probability of default by the probability of non-default, we get the odds in favor of default. Conveniently, this is simply equal to the exponential function of our linear predictor. Now, how do we interpret this?
6. Interpretation of coefficient
Let's assume that the value for variable x_j goes up by one, while all other variables remain equal. If this happens, the odds for default will be multiplied by the exponential of beta_j. Note that odds will DECREASE for increasing x_j when beta_j is negative, and odds will INCREASE for increasing x_j when beta_j is positive.
Going back to the logistic regression model we built, we see that the coefficient for age is negative point-009726. If age goes up by one, the odds of default will be multiplied by around point-991. This means that one extra year of age lowers the probability of default by nearly one percent.
7. Let's practice!
Now let's practice!