1. Sampling and point estimates
Hi, I'm Richie. Welcome to the course. You're going to learn all about sampling.
2. Estimating the population of France
To motivate why sampling might be useful, let's consider the problem of counting how many people live in France.
The standard approach is to take a census. This means contacting every household and asking how many people live there.
3. There are lots of people in France
Since there are millions of people in France, this is a really expensive process. Even with modern data collection technology, most countries will only conduct a census every five or ten years due to the cost.
4. Sampling households
In 1786 Pierre-Simon Laplace realized you could estimate the population with less effort. Rather than asking every household who lived there, he asked a small number of households and used statistics to estimate the number of people in the whole population.
This technique of working with a subset of the whole population is called sampling.
5. Population vs. sample
Two definitions are important for this course. The population is the complete set of data that you are interested in.
The previous example involved the literal population of France, but in statistics it doesn't have to refer to people.
One thing to bear in mind is that there is usually no equivalent of the census so typically, you won't know what the whole population is like. More on this in a moment.
The sample is the subset of data that you are working with.
6. Coffee rating dataset
Here's a dataset of professional ratings of coffees. Each row corresponds to one coffee, and there are eleven hundred and thirty eight rows in the dataset. The coffee is given a score from zero to one hundred "cup points". Other columns contain contextual information like the variety and country of origin, and scores between zero and ten for attributes of the coffee. The scores are averaged across all the reviewers of that particular coffee.
It doesn't contain every coffee in the world, so we don't know exactly what the population of coffees is. However, there are enough here that we can think of it as our population of interest.
7. Points vs. flavor: population
Let's consider the relationship between cup points and flavor by selecting those two columns.
This dataset contains all eleven hundred and thirty eight rows from the original dataset.
8. Points vs. flavor: 10 row sample
dplyr functions that return a subset of rows have names starting with "slice". slice_sample returns a random subset of rows.
Setting n to ten means ten random rows are returned.
By default, rows from the original dataset can't appear in the sample dataset multiple times, so we are guaranteed to have ten unique rows.
9. Base-R sampling
slice_sample is great for sampling data frames. It's built on top of a base-R function called sample, which works with vectors.
Here, using dollar subsetting retrieves the cup points column as a vector, and the size argument specifies how many random values to return.
10. Population parameters & point estimates
A population parameter is a calculation made on the population dataset. We don't have to just count values; here we calculate the mean of the cup points.
By contrast, a point estimate is a calculation based on the sample dataset. Sample statistic means the same thing. Here, the mean of the cup points is calculated on the sample.
Notice that the means are almost the same, but not identical.
11. Point estimates with dplyr
Working with data frames is often easier than working with vectors. These mean calculations can be performed using dplyr's summarize function.
12. Let's practice!
Let's get started.