1. Introduction to summary statistics: The sample mean and median
We have seen that histograms, bee swarm plots, and ECDFs provide effective summaries of data. But we often would like to summarize data even more succinctly, say in one or two numbers. These numerical summaries are not by any stretch a substitute for the graphical methods we have been employing, but they do take up a lot less real estate.
2. 2008 US swing state election results
Let's go back to the election data from the swing states again. If we could summarize the percentage of the votes for Obama at the county level in Pennsylvania in one number, what would we choose? The first number that pops into my mind is
3. 2008 US swing state election results
the mean. The mean for a given state is just the average percentage of votes over the counties. If we add the means as horizontal lines to the bee swarm plot, we see that they are a reasonable summary of the data.
4. Mean vote percentage
To compute the mean of a set of data, we use the np dot mean function, here used to compute the mean county-level vote for Obama in Pennsylvania. To put it precisely, the mean, written here as x-bar, is the sum of all the data, divided by the number n of data points.
Now, the mean is a useful statistic and easy to calculate, but a major problem is that it is heavily influenced
5. Outliers
by outliers, or data points whose value is far greater or less than most of the rest of the data.
Consider the county-level
6. 2008 Utah election results
votes for Utah in the 2008 election. There are five counties that have high vote share for Obama, one of which has almost 60%. Even though the majority of the counties in Utah had less than 25% voting for Obama,
7. 2008 Utah election results
these anomalous counties pull the mean higher up. So, when we compute the mean, we get about 28%. We might like a summary statistic that is immune to extreme data.
8. The median
The median provides exactly that. The median is the middle value of a data set. It is defined by how it is calculated: sort the the data and choosing the datum in the middle. Because it is derived from the ranking of sorted data, and not on the values of the data, the median is immune to data that take on extreme values.
9. 2008 Utah election results
Here it is displayed on the bee swarm plot. It is not tugged up by the counties with large fraction of votes for Obama.
10. Computing the median
The median is computed by simply calling the np dot median function.
11. Let's practice!
Now let's practice using these two powerful and ubiquitous summary statistics!