Comparing models

1. Comparing models

Now that you are armed with the knowledge of how to fit a multivariable model and assess the significance and size of the effects of the coefficients it still remains to see whether the inclusion of the variables improves the model fit.

2. Deviance

To answer this we consider a goodness-of-fit measure called deviance statistic, which tests the null hypothesis that the fitted model is correct. With the goodness of fit we are measuring whether our model is correctly specified and if we add more complexity would it be better. Complexity in this sense means adding more variables, non-linear or interaction terms. We will discuss in more detail the effects of interactions in the coming video. Deviance is measured in terms of log-likelihood, where formally it is defined as negative two times the log likelihood of the model fit. It represents a measure of an error where lower deviance means better model fit. For benchmark, we use the null deviance, i.e. the deviance from the model with only the intercept term. The idea is that as we add additional variables to the model the deviance would decrease therefore providing for a better fit. Generally, it is assumed that if we were to add a variable with random noise the deviance would decrease by one, so if we add p predictors to the model the deviance should decrease by more than p.

3. Deviance in Python

Let’s see how these concepts are given by the summary function. Consider the well-switching example with distance100 as the explanatory variable. In the top right column of the model summary, we are given the log-likelihood and the deviance statistic.

4. Compute deviance

We can extract the model deviance for both the null and residual deviance using null underscore deviance and deviance function respectively. Note that including the distance variable reduced deviance by 41.86. This is more than the expected reduction of one so we can say that the variable distance improved the model fit. Using the formula for the deviance and the information on the log likelihood we can also compute deviance.

5. Model complexity

It is important to note that increasing the number of variables in the model and reducing the deviance may not provide a clear cut path towards a better model fit. Say we have two models with likelihoods L1 and L2 where the likelihood of model 2 is lower. We might say that L2 provides the "better" fit, however, we also need to take into consideration the model complexity or the number of parameters to be estimated in model with L2 likelihood compared to model 1. It can happen that when applying both models to new data model one will produce a better fit than model 2, providing that model 2 is overfitting the training dataset and actually has worse fit on new data. In such situations, we say that model 2 does not generalize well on unseen data. If this occurs we would need to reduce model complexity to reduce overfitting and improve generalization.

6. Let's practice!

In the following exercises, you will practice model comparison in a multivariable setting.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.