Central to Stats: Sampling!

1. Central to Stats: Sampling!

In this chapter we go a level deeper in our statistical knowledge.

2. Lesson Overview

In this section you'll understand sampling versus populations and get a basic understanding of the central limit theorem.

3. So far, you have...

So far, you've calculated descriptive stats

4. So far, you have...

and made visuals about the data.

5. So far, you have...

More specifically, you were using all the data you were presented with.

6. So far, you have...

That means you were working with "populations".

7. What is a population?

In stats, a population is defined as an *entire* distribution of similar observations or events. Suppose you want to know how everyone riding trains feels about a rate hike. You could ask the *entire* population but this could be costly & time consuming. As a result, more often statisticians work with data samples.

8. Sampling to the rescue

To avoid spending a lot of time & effort collecting data for large populations, statisticians have the benefit of *sampling*. A sample is a subset of the population's observations. It is meant to represent the characteristics of the population. As your sample increases, the statistical information will more closely approximate the population's stats. So instead of asking everyone at the train station you could randomly sample a few dozen or hundred to approximate how the larger population may feel about raising fares.

9. Central Limit Theorem (CLT)

In stats, the central limit theorem is important. It finds that if you repeatedly randomly sample independently from any distribution,skewed or not, the resulting sample will be normal. Using an appropriate sample size along with the central limit theorem helps overcome the problem of using data from non-normal populations. The more data that's gathered in a sample, the more certainty exists in the resulting statistics. The normalcy of the sample, proven in the central limit theorem, let's you make statistical inferences from the sample to the population. Which leads you to hypothesis testing which is covered later.

10. Central Limit Theorem (CLT)

In the previous slide the central limit theorem was defined as "If a sample size from an independent, random variable is **large enough** then the sampling distribution will be normal or nearly normal." This lets us make inferences to the population. Of course "large enough" is deliberately vague, exactly how statisticians like it! The size is really dictated by two factors. First, exactly how precise do you need to be? You could ask a single train rider for their thoughts on raising fares but that's probably less accurate then asking 10 or 20 or even more. The second factor is how the population distribution behaves. The more normal the underlying population the less sample data points are needed to make accurate inferences based on the sample. For some practical inferential statisticians, if the underlying population is normal they would say a sample of 30 to 40 would be sufficient to make inferences about the population.

11. Off to do some sampling!

In the next few exercises you will calculate descriptive statistics for various sample sizes & a population along with some histograms. You'll definitely see the central limit theorem at work!