1. Convenience sampling
The point estimates you calculated in the previous exercises were very close to the population parameters that they were based on, but is this always the case?
2. The Literary Digest election prediction
In 1936, a newspaper called The Literary Digest ran an extensive poll to try to predict the next US presidential election. They phoned ten million voters and had over two million responses.
About one-point-three million people said they would vote for Landon, and just under one million people said they would vote for Roosevelt. That is, Landon was predicted to get fifty-seven percent of the vote, and Roosevelt was predicted to get forty-three percent of the vote. Since the sample size was so large, it was presumed that this poll would be very accurate.
However, in the election, Roosevelt won by a landslide with sixty-two percent of the vote.
So what went wrong? Well, in 1936, telephones were a luxury, so the only people who had been contacted by The Literary Digest were relatively rich.
The sample of voters was not representative of the whole population of voters, and so the poll suffered from sample bias.
The data was collected by the easiest method, in this case, telephoning people. This is called convenience sampling and is often prone to sample bias. Before sampling, we need to think about our data collection process to avoid biased results.
3. Finding the mean age of French people
Let's look at another example. While on vacation at Disneyland Paris, you start wondering about the mean age of French people. To get an answer, you ask ten people stood nearby about their ages.
Their mean age is twenty-four-point-six years old. Do you think this will be a good estimate of the mean age of all French citizens?
4. How accurate was the survey?
On the left, you can see mean ages taken from the French census. Notice that the population has been gradually getting older as birth rates decrease and life expectancy increases.
In 2015, the mean age was over forty, so our estimate of twenty-four-point-six is way off.
The problem is that the family-friendly fun at Disneyland means that the sample ages weren't representative of the general population. There are generally more eight-year-olds than eighty-year-olds riding rollercoasters.
5. Convenience sampling coffee ratings
Let's return to the coffee ratings dataset and look at the mean cup points population parameter. The mean is about eighty-two.
One form of convenience sampling would be to take the first ten rows, rather than the random rows we saw in the previous video. We can take the first 10 rows with the pandas head method.
The mean cup points from this sample is higher at eighty-nine. The discrepancy suggests that coffees with higher cup points appear near the start of the dataset. Again, the convenience sample isn't representative of the whole population.
6. Visualizing selection bias
Histograms are a great way to visualize the selection bias.
We can create a histogram of the total cup points from the population, which contains values ranging from around 59 to around 91. The numpy-dot-arange function can be used to create bins of width 2 from 59 to 91. Recall that the stop value in numpy-dot-arange is exclusive, so we specify 93, not 91.
Here's the same code to generate a histogram for the convenience sample.
7. Distribution of a population and of a convenience sample
Comparing the two histograms, it is clear that the distribution of the sample is not the same as the population: all of the sample values are on the right-hand side of the plot.
8. Visualizing selection bias for a random sample
This time, we'll compare the total_cup_points distribution of the population with a random sample of 10 coffees.
9. Distribution of a population and of a simple random sample
Notice how the shape of the distributions is more closely aligned when random sampling is used.
10. Let's practice!
Let's plot some histograms!