1. Relative error of point estimates
Let's review the various sampling techniques you learned about.
2. Sample is number of rows
The sample size is the number of observations, that is, the number of rows, in the sample. That's true whichever method you used to create the sample.
We'll stick to looking at simple random sampling, since it works well in most cases, and it's easier to reason about.
3. Various sample sizes
Let's calculate a population parameter, the mean cup points for the coffees. It's eighty two point one five. This is our gold standard to compare against. summarize returns a data frame, so I used the pull function to extract the value as a number.
If we take a sample size of ten, the point estimate of this parameter is wrong by about point seven.
Increasing the sample size to one hundred gets us closer; the estimate is wrong by about point one.
Increasing the sample size further to one thousand brings the estimate to about point zero one away from the true answer.
In general, larger sample sizes will give us more accurate results.
4. Relative errors
For any of these sample sizes, we want to compare the population mean to the sample mean. This is the same code you just saw, but with the sample size replaced with a variable named sample_size.
The most common metric for assessing the difference between the population and sample means is the relative error. That's the absolute difference between the two numbers, that is, you ignore any minus signs, then you divide by the population mean.
Here, we multiply by one hundred to make it a percentage.
5. Relative error vs. sample size
Here's a scatter plot of relative error versus sample size, with a smooth trend line calculated using the LOESS method. As you can see, the relative error decreases as the sample size increases. Beyond that, the plot has some important properties.
Firstly, the black line is really noisy, particularly with small sample sizes. If your sample size is small, the sample mean you calculate can be wildly different by adding one or two more random rows to the sample.
Secondly, the downward slope in the trend line is quite steep to begin with. When you have a small sample size, adding just a few more samples can give you much better accuracy. Further to the right in the plot, the slope is less steep. If you already have a large sample size, adding a few more rows to the sample doesn't bring as much benefit.
Finally, at the far right of the plot, where the sample size is the whole population, the error decreases to zero.
6. Let's practice!
Let's explore sample sizes.