1. Types of model outcomes
Up until now, we have been using GAMs to model only one type of outcome - continuous numeric values. However, GAMs have the ability to model many other types of outcomes. In this chapter, we'll learn how to use logistic GAMs for binary classification.
2. Types of outcomes
Our previous data had all outcomes, or Y values, that could take on many different numeric values, such as speed, fuel efficiency or concentration of pollution.
However, we often want to model data with binary outcomes, like the presence of organisms, customer conversion, or yes/no answers on a survey. We need to modify our models to take into account this type of data.
3. Probabilities and log-odds: logistic function
When we model a binary outcome, our prediction is a probability, which must be between zero and one. Since GAMs can have an outcome of any number, we convert the GAM output to a probability using a logistic function. The logistic function is a transformation that converts numbers of any value to probabilities between zero and one. In this context the numbers that take on any value can be interpreted as log-odds - the log of the ratio of positive outcomes to negative outcomes.
4. Probabilities and log-odds: logit function
The inverse of the logistic function is the logit function, which translates probabilities between zero and one to log-odds which can have any value.
5. Logistic and logit functions in R
In R, the logistic function is plogis(), and the logit function is qlogis(). These functions are inverses of each other. As you can see, the logistic of a logit returns the original value. You can also see how probabilities convert to odds. A 0.25 probability converts to log-odds by taking a log of the ratio of positives - one - to negatives - three.
6. Logistic GAMs with mgcv
To use logistic and logit functions to fit a GAM where we have binary outcomes, we add the family=binomial argument to our GAM function call. This tells the GAM function that outcomes are ones and zeros, and that it should fit the model on a logistic scale.
7. Logistic GAM outputs
The output of a logistic GAM looks similar to that of previous GAMs we fit. As with regular GAMs, non-parametric terms are on top, and smooths on the bottom. EDFs still indicate the complexity of smooths, and asterisks indicate significance. However, it's important to understand that outputs are on the log-odds scale. To interpret them as probabilities, we need to convert them using the logistic function. Here, the value of the intercept is 0.733. We can use the plogis() logistic function to convert it to a probability.
Converted, the intercept is about 0.67.
This means that the model predicts a 67 percent baseline chance of a positive outcome. This is what we would expect if x1 and x2 were at their average values.
8. The csale data set
Before we start exercises, lets get familiar with a new data set. The "csale" data set consists of anonymized information from the insurance industry. The outcome, "purchase", represents whether customers purchased insurance following a direct mail campaign. The other variables consist of information from credit reports of those customers. This is a small subset of all the variables and data in this set, which is available in the "Information" package. We'll be using this data to model predictors of purchasing behavior.
9. Let's practice!
Now let's fit and interpret some GAMs.