Normality tests
1. Normality tests
Normal distributions are foundational in statistics, but must be used with care. Assumptions of normality show up in all sorts of tests that we will see in this course, so it's important to understand how to assess normality.2. Height of US males
Normal distributions show up everywhere, from the birth weight of babies to ACT scores. Here we see the height of adult males in the United States. Since normal distributions show up often, it makes sense to check if our data is normally distributed. A powerful suite of statistical tools, which includes several common hypothesis tests, depends on the assumption that the underlying data is normally distributed.3. Model residuals
One example of a normal distribution can be seen when comparing model predictions to actual values. Consider the plot shown, which models a linear relationship between years of employment and salaries for police officers in the City of Austin, TX. If this line is a good fit, we expect points to be equally distributed above and below it. Indeed, one of the assumptions of linear models is that data points are normally distributed about the prediction.4. Model residuals
To check whether our assumption holds, we take the prediction from the line minus the actual salary to find the "residuals", or "errors". When we plot the residuals, we find that they are not at all normally distributed. Therefore, what seemed at first glance like a line that fit the data well ended up being anything but.5. Applications of normal distributions
Later we will discuss parametric tests. A popular example of a parametric test is a t-test, used to compare means. In a t-test, there is an underlying assumption that the sample means follow a normal distribution. If that is not the case, then the results from the t-test will be invalid.6. Histogram of salaries
Let's look at the salary data we just discussed. It looks somewhat close to normal, but it's not obvious that it is or is not. One way to test whether or not data was sampled from a normal distribution is to use an Anderson-Darling test of normality.7. Anderson-Darling test for normality
The Anderson-Darling test tests the null hypothesis that our sample was drawn from a normal distribution. Let's apply this test using the salary data we just saw.8. Anderson-Darling test in SciPy
The Anderson Darling test is done in SciPy using the stats-dot-anderson function. It uses our sample values, and returns a result object. Scipy returns an object and not just a p-value because it simultaneously tests for normality at the 15 percent level, all the way down to the one percent level. We conduct the test by comparing the test statistic, given by result-dot-statistic, to each of the critical values. If the test statistic is greater than the critical value, that means we reject the null hypothesis that the data is normally distributed. We can tell which significance level this is referring to by accessing result-dot-significance level. At all levels tested we reject the null, and thus conclude the data is definitely not normally distributed. This is especially important in t-tests and ANOVA tests. They all assume our data is normally distributed. If that is not the case, then the results from those tests are invalid. We'll dive into this more in later videos.9. Fitting a normal distribution
This check is incredibly important! If normality is satisfied we can use the stats-dot-norm-dot-cdf function to find the percent of officers with a salary of under seventy thousand dollars, by supplying the mean and standard deviation. This gives an estimate of 27 percent of all officers. However, if we do the computations by hand, we see that only 20 percent of officers had a salary under seventy thousand dollars! So the inference that assumed normality was incorrect!10. Let's practice!
Now it's time to take these ideas and apply them to make better data-driven decisions.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.