Anova and types of variability
The reason why an analysis of variance is called an analysis of variance is that we compare two different types of variability. Essentially we test the ratio of between group and within group variability. Let's consider visually what this means. To do so, we use a subset of data from the 1 million song dataset retrieved from the following link: "http://labrosa.ee.columbia.edu/millionsong/blog/11-2-28-deriving-genre-dataset"
On the left most visualization, the big red dots represents the group averages. Given the example we use in this lab, this represents the average song duration per genre. As you can see, there are some differences in the average song duration per genre. You can consider this to be the between group variability. On the right most visualizaiton, the red lines represent the variability in song duration within each genre. As you see, for this sample of songs, the song duration of classical songs range from approximately 100 seconds to a little over 700 seconds. This variability can be considered the within group variability.
The larger the variability between groups relative to the variability within groups, the larger our F statistic becomes. Before we continue to calculate our F statistic, let's first calculate some useful summary statistics. Make sure to round each number to two decimals. You can do so using the round function: round(mean(data$duration), digits = 2)
.
This exercise is part of the course
Inferential Statistics
Exercise instructions
- Calculate the overall average song duration and store it in a variable called
grand_mean
. The overall data is available in a dataframe calledsong_data
- Calculate the average song duration for the classical genre and store it in a variable called
classical_average
. The classical song data is available in a dataframe calledclassical_data
. - Calculate the average song duration for the hip hop genre and store it in a variable called
hiphop_average
. The hip hop song data is available in a dataframe calledhiphop_data
. - Calculate the average song duration for the pop genre and store it in a variable called
pop_average
. The pop song data is available in a dataframe calledpop_data
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# calculate the overall mean and store it in a variable called grand_mean
# calculate the mean song duration of the classical songs and store it in a variable
#' called classical_average
# calculate the mean song duration of the hip hop songs and store it in a variable
#' called hiphop_average
# calculate the mean song duration of the pop songs and store it in a variable
#' called pop_average