1. Stratified and weighted random sampling
Stratified sampling is a technique that allows you to sample a population that contains subgroups.
2. Coffees by country
For example, we might want to group the coffee ratings by country.
If we count the number of coffees by country, we can see that these six countries have the most data.
3. Filtering for 6 countries
To make it easier to think about sampling subgroups, let's limit our analysis these six countries. I've used filter to make the subset, but I could also have used a semi join.
4. Counts of a simple random sample
Let's take a ten percent simple random sample of the dataset using slice_sample with prop set to zero-point-one.
As with the whole dataset, we take counts, and to make it easier to compare, we can add a percent column showing what percent of the coffees in the sample come from each country.
5. Comparing counts
Here are the counts for the population and that ten percent sample side by side.
Just by chance, in this sample Guatemalan coffees form a disproportionately high percentage of the dataset, and Colombian coffees form a disproportionately low percentage.
The different makeup of the sample compared to the population could be a problem if want to analyze the country of origin.
6. Proportional stratified sampling
If we care about the percentages of each country in the sample closely matching those in the population, then we can group the data by country before taking the simple random samples.
One quirk of dplyr is that we need to ungroup the dataset again after sampling.
Now the percentages of each country in the sampled dataset are as close as possible to those in the population.
7. Equal counts stratified sampling
One variation of stratified sampling is to provide equal counts in each group. The code only has one change from before. This time, we use the n argument to slice_sample instead of prop.
Here, the resulting sample has fifteen coffees from each country.
8. Weighted random sampling
A close relative of stratified sampling that provides even more flexibility is weighted random sampling.
In this variant, you create a column of weights that specify relative probabilities for sampling each row. For example, suppose you thought that it was important to have a higher proportion Taiwanese coffees in the sample than in the population. Then you could set a weight of two for each row where the country of origin was Taiwan, and one for other rows. That means that when each row is randomly picked, rows with Taiwanese coffees have twice the chance of being picked compared to other rows.
When you call slice_sample, you pass this column of weights to the weight_by argument.
Here, you can see that Taiwan now contains seventeen percent of the sampled dataset, compared to eight point five in the population.
This sort of weighted sampling is common in political polling, where you need to correct for under or over-representation of demographic groups.
9. Let's practice!
Let's try these new techniques.