Get startedGet started for free

Cluster sampling

1. Cluster sampling

One problem with stratified sampling is that you need to collect data from every subgroup. In cases where collecting data is expensive, for example when you have to physically travel to a location to collect it, it can make your analysis prohibitively expensive. There's a cheaper alternative called cluster sampling.

2. Stratified sampling vs. cluster sampling

The stratified sampling approach was to split the population into subgroups, then use simple random sampling on each of them. Cluster sampling means that you limit the number of subgroups in the analysis by picking a few of them with simple random sampling. You then perform simple random sampling on each subgroup as before.

3. Varieties of coffee

Let's return to the coffee dataset and look at the varieties of coffee. In this image, each bean represents the whole subgroup, rather than an individual coffee. There are twenty eight of them. I've used base-R's unique function rather than tidyverse code because it's a little easier to work with a vector rather than a data frame for this. Let's suppose that it's expensive to work with all the different varieties.

4. Stage 1: sampling for subgroups

The first stage of cluster sampling is to randomly cut down the number of varieties, and we do this by randomly selecting them. Here, I've used the sample function to get three varieties.

5. Stage 2: sampling each group

The second stage of cluster sampling is to perform simple random sampling on each of the three varieties we randomly selected. The code is the same as for stratified sampling, but with a filter step beforehand. We filter the dataset for rows where the variety is one of the three selected values. Here, I've opted for equal counts sampling, with five rows from each variety.

6. Stage 2 output

Here's the result. Notice that there are only ten rows rather than the fifteen you might expect from sampling five rows from three varieties. The reason we have less output is that there are only two Blue Mountain coffees and three Sumatra coffees in the dataset. This issue won't be a problem in every dataset you try cluster sampling on, but it's something to be aware of. If you have subgroups with rare data, you might not be able to get a meaningfully large sample size for that subgroup.

7. Multistage sampling

Notice that we had two stages in the cluster sampling. We randomly sampled the subgroups to include, then we randomly sampled rows from those subgroups. Cluster sampling is a special case of multistage sampling. It's possible to use more than two stages. A common example is national surveys, which can include several levels of administrative regions, like states, counties, cities, and neighborhoods.

8. Let's practice!

Time to sample some clusters.