1. Going beyond linear regression
Hi, my name is Ita and I welcome you to this course on generalized linear models or GLMs for short. GLMs provide a versatile framework for statistical modeling of data and are often used to solve practical problems. We will examine several such problems.
2. Course objectives
The main objectives of this course are learning the building blocks of GLMs, how to train them, interpret the model results, assess performance and compute predictions. To accomplish the objectives we will set the theoretical and computational basis in chapter 1, and cover logistic and Poisson regression in remaining chapters. By the end of the course, you will have both a theoretical understanding and the working knowledge of GLMs.
3. Review of linear models
GLMs are a generalization of linear models. To understand this suppose you would like to predict salary given years of experience. In regression terms you would write it as salary is predicted by experience where tilde means "predicted by". More formally, our linear model would be written as follows,
4. Review of linear models
where y is the continuous response variable,
5. Review of linear models
x the explanatory variable,
6. Review of linear models
betas are fixed, unknown parameters that we estimate, where Beta_0 denotes the intercept and beta_1 is the slope.
7. Review of linear models
and the random error term epsilon which measures how much of the variation in the response is not explained by the explanatory variable.
8. ols() and glm()
To fit linear models in Python we use statsmodels ols function, which is imported from statsmodels dot formula dot api. Next, we initialize ols with formula and data arguments. Formula specifies output, inputs and data containing the variables. Finally, the model is fitted by calling the fit method. The glm function is considerably similar. It is also imported directly and it uses one additional argument, family, which denotes the probability distribution of the response variable. More on this in the next lessons.
9. Assumptions of linear models
Using the ols function we obtain the linear fit. The regression function tells us how much the response variable y changes, on average, for a unit increase in x. The model assumptions are linearity in the parameters, the errors are independent, normally distributed and the variance around the regression line is constant for all values of x.
10. What if ... ?
But what if the response is not continuous but binary or count? or the variance depends on the mean? Can we still fit a linear model?
11. Dataset - nesting of horseshoe crabs
To illustrate this let's consider data from nesting horseshoe crabs. The data has four explanatory variables and response variables sat and y.
12. Linear model and binary response
We are interested in predicting the probability that there is at least one satellite crab nearby the female crab given female's weight.
13. Linear model and binary response
The response variable is binary, denoting Yes or 1 if the satellite is present and No or 0 otherwise.
14. Linear model and binary response
First we fit a linear model using ols function.
15. Linear model and binary response
Taking the weight at 5.2 and reading of the probability value, we see the fit is structurally wrong since we get a value greater than 1, which is not possible by our data.
16. Linear model and binary data
To correct for this we fit a GLM, shown in blue, with the Binomial family corresponding to Binomial or logistic regression. Visually, there is a significant difference in the fitted models. Let's see what this means numerically.
17. Linear model and binary data
Now for the weight of 5.2 we obtain probability of 0.99, which is in line with binary data, since it is bounded by 0 and 1.
18. From probabilities to classes
To obtain the binary class from computed probabilities, we split the probabilities at say 0.5, which for the weight of 5.2, gives the Yes class. Similarly, for weight 1.5 we obtain No class.
19. Let's practice!
Now let's review the concepts in exercises.