Linear model and a binary response variable
In the video, you saw an example of fitting a linear model to a binary response variable and how things can go wrong quickly. You learned that, given the linear line fit, you can obtain fitted values \(\hat{y}\), which are not in line with the logic of the problem since the response variable takes on values 0 and 1.
Using the preloaded crab
dataset, you will study this effect by modeling y
as a function of x
using the GLM framework.
Recall that the GLM model formulation is:
glm(formula = 'y ~ X', data = my_data, family = sm.families.____).fit()
where you specify formula
, data
, and family
.
Also, recall that a GLM with:
- the Gaussian family is a linear model (a special case of GLMs)
- the Binomial family is a logistic regression model.
This exercise is part of the course
Generalized Linear Models in Python
Exercise instructions
- Using the
crab
dataset, define the model formula so thaty
is predicted bywidth
. - To fit a linear model using GLM formula, use
Gaussian()
for the family argument which assumes y is continuous and approximately normally distributed. - To fit a logistic model using GLM formula, use
Binomial()
for the family argument. - Fit a model using
glm()
with appropriate arguments and useprint()
andsummary()
to view summaries of the fitted models.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define model formula
formula = '____ ~ ____'
# Define probability distribution for the response variable for
# the linear (LM) and logistic (GLM) model
family_LM = sm.families.____
family_GLM = sm.families.____
# Define and fit a linear regression model
model_LM = glm(formula = ____, data = ____, family = ____).fit()
print(____.____)
# Define and fit a logistic regression model
model_GLM = glm(formula = ____, data = ____, family = ____).fit()
print(____.____)