1. Stratified and weighted random sampling
Stratified sampling is a technique that allows us to sample a population that contains subgroups.
2. Coffees by country
For example, we could group the coffee ratings by country.
If we count the number of coffees by country using the value_counts method, we can see that these six countries have the most data.
3. Filtering for 6 countries
To make it easier to think about sampling subgroups, let's limit our analysis to these six countries.
We can use the dot-isin method to filter the population and only return the rows corresponding to these six countries.
This filtered dataset is stored as coffee_ratings_top.
4. Counts of a simple random sample
Let's take a ten percent simple random sample of the dataset using dot-sample with frac set to zero-point-one. We also set the random_state argument to ensure reproducibility.
As with the whole dataset, we can look at the counts for each country. To make comparisons easier, we set normalize to True to convert the counts into a proportion, which shows what proportion of coffees in the sample came from each country.
5. Comparing proportions
Here are the proportions for the population and the ten percent sample side by side.
Just by chance, in this sample, Taiwanese coffees form a disproportionately low percentage.
The different makeup of the sample compared to the population could be a problem if we want to analyze the country of origin, for example.
6. Proportional stratified sampling
If we care about the proportions of each country in the sample closely matching those in the population, then we can group the data by country before taking the simple random sample. Note that we used the Python line continuation backslash here, which can be useful for breaking up longer chains of pandas code like this.
Calling the dot-sample method after grouping takes a simple random sample within each country.
Now the proportions of each country in the stratified sample are much closer to those in the population.
7. Equal counts stratified sampling
One variation of stratified sampling is to sample equal counts from each group, rather than an equal proportion. The code only has one change from before. This time, we use the n argument in dot-sample instead of frac to extract fifteen randomly-selected rows from each country.
Here, the resulting sample has equal proportions of one-sixth from each country.
8. Weighted random sampling
A close relative of stratified sampling that provides even more flexibility is weighted random sampling.
In this variant, we create a column of weights that adjust the relative probability of sampling each row. For example, suppose we thought that it was important to have a higher proportion of Taiwanese coffees in the sample than in the population. We create a condition, in this case, rows where the country of origin is Taiwan.
Using the where function from NumPy, we can set a weight of two for rows that match the condition and a weight of one for rows that don't match the condition. This means when each row is randomly sampled, Taiwanese coffees have two times the chance of being picked compared to other coffees.
When we call dot-sample, we pass the column of weights to the weights argument.
9. Weighted random sampling results
Here, we can see that Taiwan now contains seventeen percent of the sampled dataset, compared to eight-point-five percent in the population.
This sort of weighted sampling is common in political polling, where we need to correct for under- or over-representation of demographic groups.
10. Let's practice!
Time to try these new techniques out!