Anova and types of variability

The reason why an analysis of variance is called an analysis of variance is that we compare two different types of variability. Essentially we test the ratio of between group and within group variability. Let's consider visually what this means. To do so, we use a subset of data from the 1 million song dataset retrieved from the following link: "http://labrosa.ee.columbia.edu/millionsong/blog/11-2-28-deriving-genre-dataset"

On the left most visualization, the big red dots represents the group averages. Given the example we use in this lab, this represents the average song duration per genre. As you can see, there are some differences in the average song duration per genre. You can consider this to be the between group variability. On the right most visualizaiton, the red lines represent the variability in song duration within each genre. As you see, for this sample of songs, the song duration of classical songs range from approximately 100 seconds to a little over 700 seconds. This variability can be considered the within group variability.

The larger the variability between groups relative to the variability within groups, the larger our F statistic becomes. Before we continue to calculate our F statistic, let's first calculate some useful summary statistics. Make sure to round each number to two decimals. You can do so using the round function: round(mean(data$duration), digits = 2).

This exercise is part of the course

Inferential Statistics

View Course

Exercise instructions

Calculate the overall average song duration and store it in a variable called grand_mean. The overall data is available in a dataframe called song_data
Calculate the average song duration for the classical genre and store it in a variable called classical_average. The classical song data is available in a dataframe called classical_data.
Calculate the average song duration for the hip hop genre and store it in a variable called hiphop_average. The hip hop song data is available in a dataframe called hiphop_data.
Calculate the average song duration for the pop genre and store it in a variable called pop_average. The pop song data is available in a dataframe called pop_data.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# calculate the overall mean and store it in a variable called grand_mean


# calculate the mean song duration of the classical songs and store it in a variable
#' called classical_average


# calculate the mean song duration of the hip hop songs and store it in a variable
#' called hiphop_average


# calculate the mean song duration of the pop songs and store it in a variable
#' called pop_average

Edit and Run Code