1. Modeling and model selection
Let's look at the next analytical step: Modeling & Model Selection
2. Model specification
I estimate a logistic regression
model using the glm() function from the stats package by setting the "family" argument to binomial.
The summary output gives you, among other information, the estimated coefficients, standard errors, test statistics, and p-values. Furthermore, the AIC value is displayed.
Can you already see if there are any significant variables?
For the variable `newsletter` the three stars in the last column indicate that the coefficient is highly significant. But let's look at significance in more detail!
3. Statistical significance
Now we'll explain how the hypothesis test actually works.
If the null hypothesis (also known as H0) is correct, that is to say, newsletter does NOT have an influence on the return of a customer, you would assume the blue curve to be the true distribution of the coefficient estimate and that our data would give a z-value somewhere close to the middle of this curve.
But with the data at hand, the z value for `newsletter` is at the tail of the blue distribution. This means that the probability of finding a result like this or more extreme, assuming `newsletter` has no effect, is less than 5%.
Therefore we reject the null hypothesis and assume that `newsletter` does have a significant effect on the return of a customer. The actual distribution of the coefficient estimate probably looks more like the orange curve.
4. Coefficient interpretation
Interpretation of the coefficients is not straight forward. Without transformation they indicate the effect on the log-odds and I can only draw conclusions about the direction of the effect.
When I extract the coefficients with help of the `coef()` function and use the exponential function to remove the logarithm, however, I get the effect on the odds which can be interpreted as follows: Look at the variable newsletter1. Since e to the 0.52 equals 1.69, you can state that signing up
for a newsletter increases the odds of returning to the online shop by a factor of 1.69, so 69%, compared to somebody who is not subscribed to the newsletter.
Now that you know how to interpret single coefficients, let's talk about different models!
5. Model selection
When building a model you have to figure out which variables to include. One useful tool
is the function stepAIC() of the MASS package. Here, a full model is iteratively compared to several
other models such that variables are dropped and added based on their significance. The process goes on as long as the AIC value decreases and stops when a minimum is reached.
I've hidden the intermediate steps with trace = 0.
At the end of this process, you'll see a model with fewer explanatory variables and a superior AIC value.
6. Results of the step-AIC function
At first glance the stepAIC()
function already did a really good job at choosing relevant variables. Variables that were
dropped are mostly unspecific variables like tvEquipment. One can expect a lot of noise within those wide categories.
Still, checking the model's plausibility from a content-driven point of view is very important.
Are there any variables not included in the model that you would have expected? Or some that do not make sense?
To keep it simple for this lesson we only focus on this model. But in practice, never check only one model! More about that in the exercises.
7. Let's apply what I have shown you!
Go ahead and try it out!