1. Model validation
In the last lesson, we built a few linear models with the lm() and aov() functions plus an anova table with the anova() function. We also did some pre-modeling EDA, though we skipped something pretty crucial that we'll discuss now.
2. Pre-modeling EDA
Before modeling you should do some EDA of your data, as in the last lesson. Let's say Lending Club asked you, one of their data scientists, to examine the funded amount of the loan based on verification_status. verification_status is a variable that indicates if the applicant's reported income was somehow verified by Lending Club themselves, verified by another source, or not verified.
We looked at the mean and variance in the last lesson, with dplyr code that looked like this:
We didn't group by the purpose variable because it hadn't been recoded yet. If we run the second block of code it gives us the median and variance of funded amount.
3. Pre-modeling EDA continued
There's more, however! While a boxplot is the kind of graph that non-data scientists don't always respond to, it's often nice to build one for yourself to see the interquartile range of the variable. We can accomplish this with this ggplot2 code.
4. Boxplot
The boxplot here shows no obvious outliers, though there are a few extreme observations on the "not verified" category, represented by the dots in the upper left. We see that "source verified" and "verified" have very similar distributions. This is good news, and you can continue on with modeling.
5. Post-modeling model validation
Let's skip ahead a bit and say that you built the ANOVA model for funded amount by verification status and found that the mean funded amount for different verification statuses are significantly different. Furthermore, you did Tukey's HSD test and found that only Verified compared to Source verified is not significantly different from one another. Now what?
Now comes post-modeling model validation. This can include looking at different plots, such as a residual or residual versus fitted values plot or a qqplot, testing ANOVA assumptions such as the homogeneity, or the sameness, of variances, or even trying non-parametric alternatives to ANOVA, such as the Kruskal-Wallis test. Non-parametric just means that the test does not assume that the data came from a particular statistical distribution, the way that ANOVA tests assume data is normally distributed.
6. Post-model validation plots
The residuals versus fitted plot will show if the model is a good fit if there is a similar scatter pattern for each level of the group variable. We saw that here for this plot. If we saw a different pattern for each level, we could begin to think that there's heteroscedasticity in the residuals, and that the model may not be a great fit.
The Normal Q-Q plot should, ideally, show the points scattered around the regression line. One assumption of ANOVA and linear models is that the residuals are normally distributed. If that proves not to be true, your model might not be a good fit, and you may need to try adding explanatory variables or try different modeling techniques.
The other two graphs are less commonly discussed, but also have interpretations relevant to your model. A good fit would show in your Scale-Location plot as the residuals increasing with the fitted values, we see that here. The Residuals versus Leverage plot shows which levels are best fitted to the model. Here, the smaller levels seem better fit.
7. Let's practice!
Let's dive back in and do the all of the validation steps for the model we already built, now that we know how to use the ANOVA-related functions.