1. Convenience sampling
The point estimates you calculated in the previous exercises were very close to the population parameters that they were based on. You might wonder if that will always be the case.
2. The Literary Digest election prediction
In nineteen thirty six, a newspaper called The Literary Digest ran an extensive poll to try to predict the next US presidential election. They phoned ten million voters and had over two million responses.
About one-point-three million people said they would vote for Landon, and just under one million people said they would vote for Roosevelt. That is, Landon was predicted to get fifty seven percent of the vote and Roosevelt was predicted to get forty three percent of the vote. Since the sample size was so large, it was presumed that this poll would be very accurate.
However, in the election, Roosevelt won by a landslide with sixty-two percent of the vote.
So what went wrong? Well, in nineteen thirty six, telephones were a luxury, so the only people who had been contacted by The Literary Digest were relatively rich.
The sample of voters was not representative of the whole population of voters, and so the poll suffered sample bias.
The data was collected by the easiest method, in this case, telephoning people. This is called convenience sampling, and is often prone to sample bias. you need to think about your data collection process to get unbiased results.
3. Finding the mean age of French people
Let's look at another example. While on vacation at Disneyland Paris, you start wondering about the mean age of French people. To get an answer, you ask ten people stood nearby about their ages.
Their mean age is twenty four point six years. Do you think this will be a good estimate of the mean age of all French citizens?
4. How accurate was the survey?
On the left you can see mean ages taken from the French census. Notice that the population has been gradually getting older as birth rates decrease and life expectancy increases.
In twenty fifteen the mean age was over forty, so our estimate of twenty four point six is way off.
The problem is that the family friendly fun at Disneyland means the sample ages weren't representative of the general population. There are more eight year olds than eighty year olds riding rollercoasters.
5. Convenience sampling coffee ratings
Let's return to the coffee ratings dataset, and look at the mean cup points population parameter. The mean is about eighty two.
One form of convenience sampling here would be to take the first ten rows, rather than the random rows you saw in the previous video. You can take the first rows with dplyr's slice_head, or head from base-R.
The mean cup points from this sample is higher at eighty nine. The discrepancy suggests that coffees with higher points appear near the start of the dataset. Again, the convenience sample isn't representative of the whole population.
6. Visualizing selection bias
Histograms are a great way to visualize the selection bias.
Here's a histogram of the cup points from the population, with values ranging from just under sixty to just over ninety.
Here's the same histogram from the sample. I've set the xlim values to make the x-axis the same as in the previous plot. Notice how all the values are on the right-hand side: the distribution in the sample is not the same as the distribution in the population.
7. Visualizing selection bias 2
For comparison, here's how the histograms look when random sampling is used. Notice how the shape of the distributions is more closely aligned.
8. Let's practice!
Let's plot some histograms!