ANOVA

1. ANOVA

In this video we'll formally introduce analysis of variance, in other words ANOVA. We're going to start our discussion with variability partitioning, which means considering different factors that contribute to variability in our response variable.

2. Marathon finishing times

Consider runners in a marathon. Not everybody is going to finish at the same time. The variability in their finishing times will likely be due to a variety of factors. One of them might be how much they trained for this marathon. However, there will certainly be other factors as well: physical characteristics, previous running experience, sleep, warm up exercises, and so on and so forth. Suppose we're interested in evaluating how strongly training is associated with finishing time. In order to do so we partition the total variability in finishing times as variability due to this variable, and variability due to all other factors. We're going to build up on this idea of variability partitioning, and the F statistic we introduced earlier, to work our way through the analysis of variance output.

3. ANOVA for vocabulary scores vs. self identified social class

Let's quickly remind ourselves of the data we're working with from the General Social Survey on vocabulary scores, a numerical variable, and social class, a categorical variable with four levels. Our null hypothesis is that the average vocabulary score is the same across all social classes, and the alternative hypothesis is that average vocabulary scores differ between at least one pair of social classes.

4. Variability partitioning

Let's outline this idea of variability partitioning: The total variability in vocabulary scores times is basically the variance in vocabulary scores of all respondents to the general social survey. We partition the variability into two: Variability that can be attributed to differences in social class, and variability attributed to all other factors. Variability attributed to social class is called "between group" variability, since social class is the grouping variable in our analysis. The other portion of the variability is what we're not interested in, and, in fact, it is somewhat of a nuisance factor for us since if everyone within a certain social class scored the same then we would have no variability attributed the other factors. This portion of the variability is called "within group variability".

5. ANOVA output

Here is a look at ANOVA output. The first row is about the between group variability, and the second row is about the within group variability. We often refer to the first row as the "group" row, and the second row as the "error" row. Next we'll go through some of the values on the anova table and what they mean.

6. Sum of squares

Let's start with the Sum of Squares column. These values measure the variability attributed to the two components: the variability in vocabulary scores explained by social class and the unexplained variability -- that is, unexplained by the explanatory variable in this particular analysis. The sum of these two values would make up sum of squares total, which measures the total variability in the response variable, in this case this would be the total variability of the vocabulary scores. This value is calculated very similarly to the variance, except that it's not scaled it by the same size. More specifically, this is calculated as the total squared deviation from the mean of the response variable. One statistic not presented on the ANOVA table that might be of interest is the percentage of the variability in vocabulary scores explained by the social class variable. We can find this as the ratio of the sum of squares for class divided by the total sum of squares. In this case, 7.6% of the variability in vocabulary scores is explained by self identified social class. This value is the R-squared value we would obtain if we set up this analysis as a regression predicting vocabulary score from social class.

7. F-statistic

The main values of interest on this table are the F-statistic, which is calculated as the ratio between the “between” and “within” group variabilities if in fact the means of all groups are equal, and the p-value, which is the area under the F-distribution beyond the observed F-statistic. We draw conclusions based on this p-value just like with any other hypothesis test we've seen so far.

8. Let's practice!

Time to put this into practice.