Simple random and systematic sampling
1. Simple random and systematic sampling
There are several methods of sampling from a population. Here we'll look at simple random sampling and systematic random sampling.2. Simple random sampling
Simple random sampling works like a raffle or lottery. You start with your population of raffle tickets or lottery balls, and randomly pick them out one at a time until you have enough winners.3. Simple random sampling of coffees
In our coffee ratings dataset, instead of raffle tickets or lottery balls, the population consists of coffee varieties. To perform simple random sampling, we just take some at random, one at a time. Each coffee has the same chance as any other of being picked. When using this technique, sometimes we might end up with two coffees that were next to each other in the dataset, and sometimes we might end up with some large areas of the dataset that are not selected from at all.4. Simple random sampling in R
You've already seen how to do simple random sampling in R. Call slice_sample, setting n to the size of the sample.5. Systematic sampling
Another sampling method is known as systematic sampling. This samples the population at regular intervals. Here, looking from top to bottom and left to right within each row, every fifth coffee is sampled.6. Adding a row ID column
Demonstrating systematic sampling is easier if the dataset includes a row number column. We can add one using rowid_to_column from the tibble package.7. Systematic sampling in R
Systematic sampling in R is slightly fiddlier than simple random sampling. The tricky part is determining how big the interval should be between each row to include in the sample. Suppose you want a sample size of five coffees. The population size is the number of rows in the dataset, one thousand, three hundred and thirty eight. The interval is the population size divided by the sample size, except you want the answer to be an integer. So we use integer division, with a forward slash wrapped in percent signs. This is like standard division, but discards any fractional part. One-three-three-eight divided by five is two hundred and sixty seven point six, and discarding the fractional part leaves two hundred and sixty seven.8. Systematic sampling in R 2
There are many ways of doing this next step of filtering for every two hundred and sixty seventh row. One way is to create a vector of row indexes by taking a sequence from one to the sample size all multiplied by the interval. Then you call slice, passing those row indexes. Play around and see how many ways you can think of to solve this.9. The trouble with systematic sampling
There is a problem with systematic sampling. Suppose we are interested in statistics about the aftertaste attribute of the coffees. Plotting aftertaste against rowid with a smooth trend line shows a pattern. Earlier rows have higher aftertaste scores than later rows. This introduces bias into the statistics that we calculate. In general, it is only safe to use systematic sampling if a plot like this has no pattern. That is, it just looks like noise.10. Making systematic sampling safe
To ensure that systematic sampling is safe you can randomize the row order before sampling. slice_sample has an argument named prop that lets you specify the proportion of the dataset to return in the sample, rather than the absolute number of rows that n would specify. Setting prop to one randomly samples the whole dataset. In effect, this shuffles the rows of the dataset. Next, the row IDs need to be reset, so that they go in order from one again. Redrawing the plot with the shuffled dataset shows no pattern between aftertaste and row ID. This is great, but note that once we've shuffled the rows, systematic sampling is essentially the same as simple random sampling.11. Let's practice!
Let's get sampling!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.