1. Technical conditions for linear regression
In the previous chapter you saw that sometimes the mathematical model was not appropriate for inferential analysis (that is, for calculating p-values and confidence intervals). In this chapter, we'll provide details for when the mathematical model is appropriate.
2. What are the technical conditions?
Because your goal in this course is to perform inferential calculations on the linear regression model, it is important that the sampling distribution for the estimated slope has the expected form. That is, we will be able to apply our methods only if the points are linear, independent, normally distributed, and have equal variability around the line. Note that the conditions are given by the linear model equation as well as spelled out using the LINE mnemonic.
If the sampling distribution isn't accurate, the p-values and confidence intervals that you calculate could be wrong.
3. Linear model: residuals
The `augment` function in the `broom` package calculates the fitted and residual values for every point in the dataset. The output of the `augment` function defaults to `.fitted` and `.resid`.
If the linear model is appropriate, a plot of the residuals versus the fitted values should show a non-patterned scattering of the points. The fitted model is usually described by Roman letters (b0 and b1), whereas the population model we want to find is described by Greek letters (beta0 and beta1).
The residual plot here (fitted value plotted on the x-axis, residual values plotted on the y-axis) shows a scattering of points which do not indicate any violation of the regression technical conditions.
4. Not linear
The plot here demonstrates a clear violation of the linear model. The variables have a quadratic relationship, not a linear one!
5. Not linear: residuals
The residuals associated with the quadratic model also look curved. For the technical conditions to hold, you need a non-patterned scattering of points. Just like the original scatter plot, the residual plot with fitted value on the x-axis, and residual on the y-axis continues to demonstrates a violation of the linear technical condition.
6. Not normal
The violation here is not as obvious as the non-linear violation. In this plot, the points are not normally distributed around the line. That is, although the residuals are centered at zero, the points under the line are closer to the line and the points above the line are scattered farther from the line.
7. Not normal: residuals
The residual plot makes it even easier to see the violation of the technical condition related to normality. If the residuals were normally distributed around the line, they would be equally far from the line in the positive and negative direction. Here, the points below the line do not spread out nearly as far as the points above the line.
8. Not equal variance
The last violation we will investigate is unequal variability across different values of the explanatory variable. In this plot, it seems as though the Y values associated with small X are quite close to the line whereas Y values associated with large values of X have a much larger variability around the line.
9. Not equal variance: residuals
Once again, the residual plot accentuates the technical condition violation by demonstrating the increasing variability of the residuals around the line as the fitted value increases.
10. Let's practice!
As mentioned previously, meeting the technical conditions will help to ensure that your p-values and confidence interval estimates are an accurate reflection of your population values. Up soon, you will practice transforming data so as to meet the conditions. But for now, it is your turn to practice determining when the technical conditions have been met.