1. Crash course on GLMs
So far, we have covered linear mixed models. These models make important assumptions about the data and its predictor variables. During this tutorial, we will see how to relax one of these assumptions: The assumption of normality.
We will cover "generalized" linear models during this video. Later in this chapter, we will cover generalized linear mixed models.
2. Assumption of normality
Linear mixed models, like linear models, assume normality. More precisely, the models assume that the residuals of the model are normal. However, data often does not meet this assumption. Historically, one solution was to transform the data. For example, with proportion data, 10% of respondents might answer yes to a question. Historically this would be transformed using an arcsine transformation. However, advances in modeling now allow us to directly model the raw data. Hence, a recently published ecology article proclaims that the "arcsine is asinine". For example, R can now readily model non-normal distributions, such as counts data, using Poisson distributions and proportions using binomial distributions.
3. R syntax for GLM
The generalized linear model or glm() function uses the same formula notation used by the linear model function. Unlike the lm() function, the glm() function in R allows for different "families" of distributions to be fit. These families are the error distributions and how the error distributions are linked to the observed data. The default glm family is the Gaussian or the normal distribution. A glm() with a Gaussian distribution is the same as lm(). All families in base R are listed in their help file.
4. The Poisson distribution
When dealing with count data, the first distribution I consider is the Poisson distribution. This distribution models data where a certain number of events occur per unit area or time. For example, we could use a Poisson error term to model the number of visitors to a website per hour. The example Poisson distribution plotted here has a mean of 3. Notice how the distribution requires discrete values that are positive. The distribution also assumes the mean is equal to the variance. The Poisson distribution works well for small counts that are less than approximately 30 observations. If there are more than 30 observations, another distribution, such as the normal, will usually be more appropriate.
5. Example with Poisson regression
To fit a glm() with the Poisson error term, one simply specifies the "Poisson" family. This model could also be called a Poisson regression. You'll get a chance to fit this model during the exercise.
6. Logistic regression
Oftentimes, I deal with binary data that has either a zero or a one as an output. For example, we might be interested in the probability of an answer yes, indicated by a one, or no, indicated by a zero, to a question. A logistic regression, or more broadly, binomial regression allows these outcomes to be modeled.
7. Example with logistic regression
To fit a logistic regression in R, the glm() function is used with a binomial error term. A glm() in R has three methods for inputting binomial data: the data can be in a "binary" format, the data can be in the "Wilkinson-Rogers" format, or the data can be in a weighted format. All three approaches produce the same coefficient estimates, but differ in the degrees of freedom and resulting deviance values. The last two methods consider the data as having fewer observations and therefore, fewer degrees of freedom, because they use each treatment for the number of observations, not the actual observations themselves.
You will get to see all three methods in the exercise.
8. Let's practice!
Now, you get to try generalized linear models yourself!