1. Measures of center
What do we mean by a typical observation? For example, it sounds perfectly fine to state a statistic like: the typical life expectancy in the US is 77 (point) 6 years, but where does that number come from? Before we answer that question, let's make this more concrete by introducing a dataset that we'll be working with throughout the chapter.
2. County demographics
Researchers in public health have compiled data on the demographics of every county in the US. We see here that we have 4 variables: the state name, the county name, then the average life expectancy, and the median income. Let's focus on the life expectancy in the first 11 counties in this dataset, all in the state of Alabama.
3. Center: mean
I'm going to simplify and extract those 11 numbers by rounding the values of expectancy, when looking at the first 11 cases. The result, x, is 11 integers, all in the mid 70s.
OK, so let's ask the question again: what is a typical value for this set of 11 numbers?
4. Center: mean
The most common answer is the mean, which is the sum of all of the observations divided by the number of observations. We learn that the mean life expectancy in these 11 counties is around 74 (point) 5 years. We can also use the built-in function mean.
If we visualize "x" as a dot plot,
5. Center: mean
we can represent the mean as a vertical red line.
Another measure of "typical" or "center" is the median.
6. Center: mean, median
The median is the middle value in the sorted dataset. So if we sort x, the middle value is 74. We can also use the built-in function median. Let's draw that line in blue.
7. Center: mean, median
A third measure of center is the mode.
8. Center: mean, median, mode
The mode is simply the most common observation in the set. We can look at the dot plot and see that it is 74. We can also run the table function to see that the greatest count was 3 at 74. Let's plot the mode right next to the median in gold.
9. Center: mean, median, mode
The mode is the highest point on a plot of the distribution, while the median divides the dataset into a lower half and an upper half. In this case, those values are the same, but that is often not the case. The mean can be thought of as the balance point of the data and it tends to be drawn towards the longer tail of a distribution. This highlights an important feature of the mean: its sensitivity to extreme values. For this reason, when working with skewed distributions, the median is often a more appropriate measure of center.
Now that we have some sensible measures of center, we can answer questions like: is the typical county life expectancy on the West Coast similar to the typical life expectancy in the rest of the country?
10. Groupwise means
To answer this, we start by creating a new variable that will be TRUE if the state value is one of "California", "Oregon", or "Washington", and FALSE otherwise, and save it back to the original dataset. To compute groupwise means, we pipe the dataset into the group by function indicating that we'd like to establish groups based on our new variable. Then we can summarize those groups, West Coast counties and non- West Coast counties, by taking the mean and median of their life expectancies. We learn that looking at both mean and median, the typical West Coast county has a slightly higher average life expectancy than counties not on the West Coast.
11. Without group_by()
Group by and summarize form a powerful pair of functions, so let's look into how they work. Let's look at a slice of 8 rows in the middle of the dataset and remove the group by line. This will summarize the expectancy variable by taking its mean across all 8 rows.
12. With group_by()
If we add a line to group by West Coast, it's effectively breaking the dataset into two groups and calculating the mean of the expectancy variable for each one separately.
group by and summarize open up lots of possibilities for analysis, so let's get started.
13. Let's practice!
In all of the exercises for this chapter, you'll be working with similar data, but on a global scale, in the Gapminder data.