Get startedGet started for free

Measures of spread

1. Measures of spread

In this lesson, we'll talk about another set of summary statistics: measures of spread.

2. What is spread?

Spread is just what it sounds like - it describes how spread apart or close together the data points are. Just like measures of center, there are a few different measures of spread.

3. Variance

The first measure, variance, measures the average distance from each data point to the data's mean.

4. Calculating the variance

To calculate the variance, we start by calculating the distance between each point and the mean, so we get one number for every data point.

5. Calculating the variance

We then square each distance and then add them all together.

6. Calculating the variance

Finally, we divide the sum of squared distances by the number of data points minus 1, giving us the variance. The higher the variance, the more spread out the data is. It's important to note that the units of variance are squared, so in this case, it's 19.8 hours squared. We can calculate the variance in one step using the var function.

7. Standard deviation

The standard deviation is another measure of spread, calculated by taking the square root of the variance. It can also be calculated using the sd function. The nice thing about standard deviation is that the units are usually easier to understand since they're not squared. It's easier to wrap your head around 4 and a half hours than 19.8 hours squared.

8. Mean absolute deviation

Mean absolute deviation takes the absolute value of the distances to the mean, and then takes the mean of those differences. While this is similar to standard deviation, it's not exactly the same. Standard deviation squares distances, so longer distances are penalized more than shorter ones, while mean absolute deviation penalizes each distance equally. One isn't better than the other, but SD is more common than MAD.

9. Quartiles

Before we discuss the next measure of spread, let's quickly talk about quartiles. Quartiles split up the data into four equal parts. Here, we call the quantile function to get the quartiles of the data. This means that 25% of the data is between 1-point-9 and 7-point-85, another 25% is between 7-point-85 and 10-point-10, and so on. This means that the second quartile splits the data in two, with 50% of the data below it and 50% of the data above it, so it's exactly the same as the median.

10. Boxplots use quartiles

The boxes in box plots represent quartiles. The bottom of the box is the first quartile, and the top of the box is the third quartile. The middle line is the second quartile, or the median.

11. Quantiles

Quantiles, also called percentiles, are a generalized version of quartile, so they can split data into 5 pieces or ten pieces, for example. By default, the quantile function returns the quartiles of the data, but we can adjust this using the probs argument, which takes in a vector of proportions. Here, we split the data in five equal pieces. We can also use the seq function as a shortcut, which takes in the lowest number, the highest number, and the number we want to jump by. We can compute the same quantiles using seq from zero to one, jumping by 0-point-2.

12. Interquartile range (IQR)

The interquartile range, or IQR, is another measure of spread. It's the distance between the 25th and 75th percentile, which is also the height of the box in a boxplot. We can calculate it using the quantile function to get 5-point-9 hours.

13. Outliers

Outliers are data points that are substantially different from the others. But how do we know what a substantial difference is? A rule that's often used is that any data point less than the first quartile minus 1.5 times the IQR is an outlier, as well as any point greater than the third quartile plus 1.5 times the IQR.

14. Finding outliers

To find outliers, we'll start by calculating the IQR of the mammals' body weights. We can then calculate the lower and upper thresholds following the formulas from the previous slide. We can now filter the data frame to find mammals whose body weight is above or below the thresholds. We can see that there are eleven body weight outliers in this dataset, including the cow and the Asian elephant.

15. Let's practice!

Time to practice measuring spread and finding outliers.