Logistic regression: predicting the probability of default

1. Logistic regression: predicting the probability of default

You've now learned how to build a logistic regression model. But of course, you would like to know how to use the obtained model, and its parameter estimates to compute estimates for the probability of default for the test set cases. The test set predictions can be compared to their actual outcomes (default or non-default), which is necessary to validate the model.

2. An example with "age" and "home ownership"

To guide you through this principle, let's look at a rather simple example, containing variables age and home_ownership. The parameter estimates were obtained using the training set from our loan data. Using the formula discussed in the previous video, and the parameter estimates, we can now compute a probability of default using the covariate values of the cases that are in the test set. Again, note that we have a parameter estimate for each of the categories in the categorical variable, except one for the reference category, which is the MORTGAGE homeownership category in this model.

3. Test set example

Let's look at a specific example. The customer on the first line of the test set is 33 years old and a renter. Using the model at hand, you can compute the probability of default like this. Beta 2 and beta 3, which belong to the "other" and "owner" category for homeownership, respectively, are each multiplied by zero. Plugging the other parameter estimates in, we get to a probability of default of 11.56% for this specific customer.

4. Making predictions in R

So how does this work in R? You can use the function predict to predict the probability of default for one or several test set cases. Let's start by selecting the first case of the test set. Whether you are using one or several cases, you should always make sure the test cases are stored in a data frame. Let's explore this test case. The age is 33, and home_ownership category is RENT. Using the predict() function, the first argument contains the fitted model. In the second argument, you provide the new data in test_case. Doing this, you get a result of -2.03499. This is not yet the fitted probability of default, but the linear predictor. Changing the type argument to "response", you get the actual predicted probability of default for this customer.

5. Let's practice!

Now, let's make some predictions!