1. Measures of variability
How do you summarize the variability that you see in a set of numbers?
2. Insert title here...
Let's consider the life expectancies in those first 11 counties in the US county-level dataset, which we saved to the object x. Most methods have us thinking about variability along the lines of how much the data is spread out from the center.
3. Insert title here...
Let's choose to define the center by the mean and then, quantify the distance from the mean by taking the difference between each observation and that mean. That results in 11 differences, some positive, some negative. We'd like to reduce all of these differences to a single measure of variability, so let's add them up. This is R's scientific notation, saying the sum is -1 (point) 42 times 10 to the -14. That number is essentially zero. Clearly something has gone wrong because we can tell that there is variability in this dataset, but our measure hasn't detected it. The problem is that the positives and negatives have canceled each other out. This is easy to fix: you can square each difference to get rid of the negatives. This new measure is better, but it has an undesirable property: it will just keep getting bigger the more data that you add. You can fix this by dividing this number by the number of observations, 11. OK, now this looks like a useful measure: you find the center of the data, then find the squared distance between the observations and that mean averaged across whole dataset. If you change the n to an n-1, you are left with what's called the sample variance, one of the most useful measures of the spread of a distribution. In R, this statistic is wrapped up into the function v-a-r for variance.
4. Insert title here...
Another useful measure is the square root of this number, which is called the sample standard deviation or just sd in R. The convenient thing about the sample standard deviation is that, once computed, it is in the same units as the original data. In this case we can say that the standard deviation of these 11 counties' life expectancies is 1 (point) 69 years. By comparison, the variance of this sample is 2 (point) 87 years squared, which is a unit that we have no real intuition about.
There are two more measures of spread that are good to know about. The interquartile range, or IQR, is the distance between the two numbers that cut-off the middle 50% of your data. This should sound familiar from the discussion of box plots: the height of the box is exactly the IQR. We can either get the first and third quartiles from the summary function and take their difference or we can use the built-in IQR function.
The final measure is simply the range of the data: the distance between the maximum and the minimum. max and min are indeed functions in R, but you can also use the nested diffrangex.
For any dataset, you can compute all four of these statistics, but which ones are the most meaningful? The most commonly used in practice is the standard deviation, so that's often a good place to start. But what happens if the dataset has some extreme observations?
5. Insert title here...
Let's say that Baldwin County, Alabama, the county with a life expectancy around 78, instead had a life expectancy of 97. If you recompute the variance and the standard deviation, you see that they've both gone through the roof. These measures are sensitive to extreme values in the same way that the mean is as a measure of center. If you recompute the range, it will certainly increase because it is completely determined by the extreme values. For this reason, the range is not often used.
If you recompute the IQR, however, you see that it hasn't budged. Because that observation is still the highest, the quartiles didn't move. This reveals a good reason for using the IQR: in situations where your dataset is heavily skewed or has extreme observations.
6. Let's practice!
You'll put your understanding of variability to use in the next exercises.