Get startedGet started for free

Measures of spread

1. Measures of spread

Welcome back!

2. Statistics for describing a variable

So far, we have seen a few statistics to describe a variable: count, median, average, minimum and maximum values, quartiles and interquartile range, modality, skewness and kurtosis. In this chapter, we will see another set of summary statistics that describe the spread of your data, when your data is normally distributed.

3. Measures of spread

Spread is what it sounds like: it measures how spread out or close together your observations are. Both histograms have the same mean and total number of observations. The only difference is the spread around the mean. Note that spread around the mean is heavily affected by kurtosis and skewness; the more extreme values and asymmetry you have in your dataset, the larger the spread will be. Therefore, measuring spread around the mean is only useful when dealing with unimodal, symmetrical, bell shaped distributions, such as these histograms. In general, we call these distributions normally distributed.

4. Variance

One way to calculate spread around the mean is by calculating variance: you take the differences between each data point and the mean, then square the difference, and then take the total sum of these squared differences. Lastly, you divide the sum by the number of observations minus one. In other words, variance is the average of the squared differences from the mean. Don't worry about the formulas, they merely illustrate what is happening behind Tableau's internals. The higher the variance, the more spread out your data is. Note that the unit of variance is squared, which makes it harder to interpret.

5. Standard deviation (SD or $s$)

By taking the square root of the variance, you get the standard deviation, s. Consequently, the unit of the standard deviation is the same as the variable unit. Therefore, the standard deviation allows you to say how far on average the data points lie from the mean. When you have a normal distribution, a rule of thumb is that about 68 percent of the data lies within the width of one standard deviation from the mean. This amount increases with more standard deviations: 95 percent for two standard deviations, and 99 point 7 percent for three standard deviations. Depending on the context, the number of standard deviations can be used as a threshold to pinpoint unusual values.

6. Population vs. sample

When calculating variance or standard deviation, it is important to know whether you're working with a whole dataset (the so-called population), or just a part of it (a sample). Take this example: you want to know how many freshwater species live in your town's lake (the population) to get an idea of its biodiversity value. It would be impossible and unethical to catch all species of the lake and try to count them.

7. Population vs. sample

Instead, you take random samples of different parts of the lake, count the number of species in each sample, and put them back. This process is called sampling.

8. Population vs. sample

Once you have enough samples, you can make statements about the whole population, in this case all species of the lake. Making these statements from a sample about the population is called statistical inference. It allows you to estimate the population mean, variance, or standard deviation from your sample mean, variance or standard deviation.

9. Calculating spread in sample vs. population

So, what is the difference when calculating measures of spread of a sample versus the population? The formulas you've seen so far, are the ones to calculate the sample variance and the sample standard deviation. Tableau will by default consider your dataset as a sample, and not the whole population. This might indeed be the case, when you for example only have data on thousand people per European country (your sample), and you want to know the spread of education status of all people in Europe (the population). In this case, you calculate the variance and standard deviation using these sample formulas. On the other hand, if you have data on the education status of all thousand people in your university, and want to calculate the spread of education status in your university only, you don't need to generalize your results to other universities. You can consider your university as the whole population, and use the population formulas.

10. Calculating spread in sample vs. population

Note that the only difference in the calculation is in the denominator: you divide by the number of observations when you have the population, instead of the number of observations minus one when you have a sample. This is especially important when working with small sample sizes. With higher sample sizes, the effect of doing minus one will be negligible and your sample statistic will be close to your true population parameter.

11. Let's practice!

Let's see if your knowledge on measures of spread varies from the mean.