1. Inferential Statistics Concepts
Previously, we found the single best value of each model parameter and used them to build a model.
In this Chapter, we'll treat a model parameter, like slope, not as a single value but as a "distribution" of values, whose mean gives our "best" value.
We'll use random sampling to estimate these parameter distributions, and use these distributions to make probabilistic statements about both model parameters and model predictions.
This will be significantly more challenging than previous Chapters, and will require the introduction of more advanced statistical tools. But it's well worth the effort, and I'll guide you every step of the way.
2. Probability Distribution
What do we mean by "probability distributions"?
Here's an example: a decade of daily temperatures for August in Austin Texas.
The MEAN is about 100 degrees Fahrenheit; that's climate.
The SPREAD tells us the temperature varies from day to day; that's weather.
Recall the sea surface temperature exercise. The overall linear trend, modeled by slope, was the "climate", and the variation about that trend was the "weather", seen as residuals.
3. Populations and Statistics
We usually don't have the entire "population" of data; what counts as "all the data" will depend on context.
For example, temperatures for every day in a decade, or the height of every person in the world.
We define a "sample" as a subset of the "population".
For example, daily high temperatures for just one month, or heights of people from just one country.
A "statistic" summarizes a distribution. For example, the mean temperature or the median height.
4. Sampling the Population
The population statistics and the sample statistics are not usually the same.
If sample temperatures were all from summer days, the sample mean would be higher than the population mean computed from a full year.
To ensure the sample is REPRESENTATIVE of the population, both having about the same center and spread, we randomly draw points from the population using `np.random.choice()`.
5. Visualizing Distributions
To verify that the sample is "representative" of the population, we plot a histogram of both.
Doing so with raw counts is usually not useful because the sample count is often much smaller than the population.
6. Visualizing Distributions
If we "normalized" both population and sample, dividing sample bins by 31 and population bins by 3650, each distribution sums to 1.0
The difference in normalized distributions becomes clear.
7. Probability and Inference
The shape of a SAMPLE DISTRIBUTION is often used to make "inferences" about the POPULATION DISTRIBUTION.
If we divide each bin count by the total number of days in the data set, the sum across all bins is 1.0. So the sum of the half to the right of 100 degrees would be about 0.5
From this NORMALIZED distribution, rather than just stating that we have some uncertainty, we can INFER more precisely that there is a 50% probability that any day in August will exceed 100 degrees. This is a new type of prediction.
8. Visualizing Distributions
If we use `np.random.choice()` again, to take a second sample, we see that the two samples differ. What if we "resample" 100 times?
9. Resampling
Think of any measured data set as a sample randomly drawn from a larger population. If you collected another data set, it would be different.
Resampling is used to quantify this variation and infer variation in the population.
Here we take 20 samples, and compute the sample mean each time. If each sample was a road trip, we'd have a distribution of 20 mean speeds.
To characterize the shape of this distribution, we take the mean and standard deviation.
10. Let's practice!
Now it's your turn to practice sampling with python.