Sampling and bias

1. Sampling and bias

We saw that inference is directly influenced by the sample chosen. This means that our choice of a sample is fundamental in making principled, repeatable, and valid inference. The main concern when selecting a sample is bias.

2. Bias

A biased sample is one where one group occurs far more or less often in the sample than in the population. Using a biased sample in inference means we will draw conclusions based on a sample which does not look like our population.

3. Biased samples

Consider the case of salaries of employees at a company. What if we took a sample of our most trusted co-workers and found a mean salary of 96000 dollars. Does this represent a reasonable estimate at the population statistic, namely, the mean salary of all employees at the company? It's impossible to know for sure, but analyzing repeated samples can help us.

4. Sampling distribution

We'll ask HR for the mean salary of ten randomly selected employees. We'll repeat this process one hundred times, and store each of these means. The collection of all of these sample means is referred to as a sampling distribution, as it shows the distribution of our samples. We'll then plot this sampling distribution as a histogram. Note that the x-axis is the average salary from each of our samples.

5. Sampling distribution

Here we see that, in most of these samples, the mean salary was around eighty three thousand dollars. Recall that the sample of just our friends was ninety six thousand dollars, which seems suspiciously high. Perhaps our friends are high earners, and would represent a biased sample of all employees at the company.

6. Depends on the sample

So what is and is not affected by our choice of a sample? Samples affect point estimates, and thus affect inference. Whenever we compute a point estimate, the value comes directly from our data. So if we change our data, our point estimate will also change. In particular, when performing a hypothesis test, the result we get is directly dependent on the sample, and may change with a different sample. Therefore, the inference we make may be completely different depending on the sample! In addition, when we conduct a hypothesis test, the computation of our p-value uses data coming from our sample, and is thus directly affected by our sample. Again, we use the results of our hypothesis test to make inference about our situation. Thus a different sample can potentially yield a completely different conclusion!

7. Doesn't depend on the sample

On the other hand, some things do not depend on our sample. For one, there is some true population statistic which we can never observe. This is the value coming from the entire population, such as the percentage of our potential customers who will actually buy the product. The goal of inference is generally to infer what this value may be. The value of this population statistic is not affected by our choice of a sample. We discussed in the previous slide how our sample can affect the calculation of our p-value. However, once we have a p-value, our conclusion is based only on this p-value and our choice of alpha. Therefore, given a p-value, our conclusion no longer depends on the sample.

8. Let's practice!

Now that we've seen how bias affects samples, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Foundations of Inference in Python

AdvancedSkill Level

4.9+

173 reviews