1. Confidence intervals
In the last few exercises, you looked at relationships between the sampling distribution and the bootstrap distribution.
2. Confidence intervals
One way to quantify these distributions is the idea of values within one standard deviation of the mean giving a good sense of where most of the values in a distribution lie.
In this final lesson, we'll formalize the idea of values close to a statistic by defining the term "confidence interval".
3. Predicting the weather
Consider a meteorologist predicting weather in one of the world's most unpredictable regions - the northern Great Plains of the US and Canada.
Rapid City, South Dakota was ranked as the least predictable of the 120 US cities with a National Weather Service forecast office.
Suppose you've taken a job as a meteorologist at a news station in Rapid City. Your job is to predict tomorrow's high temperature.
4. Your weather prediction
You analyze the weather data using best meteorological forecasting tools available to you and predict a high temperature of 47 degrees Fahrenheit.
47 degrees is your point estimate.
Since the weather is variable, and many South Dakotans will plan their day tomorrow based on your forecast, you'd like to present a range of plausible values for the high temperature instead.
On your weather show, you report that the high temperature will likely be between forty and fifty-four degrees tomorrow.
5. You just reported a confidence interval
This prediction of forty to fifty-four degrees can be thought of as a confidence interval for the unknown quantity of tomorrow's high temperature. That is, although you can't be sure of the exact temperature, you are confident that it will be in that range.
Results are often written as the point estimate followed by the confidence interval lower and upper bounds in parentheses or square brackets.
When the confidence interval is symmetric around the point estimate, an alternate way of presenting results is by giving the point estimate plus or minus a value.
Here, seven degrees is the margin of error.
6. Bootstrap distribution of mean flavor
Here's the bootstrap distribution of mean flavor from the coffee dataset.
7. Mean of the resamples
We can calculate the mean of the resampled mean flavors.
8. Mean plus or minus one standard deviation
If we use a confidence interval of one standard deviation either side of the mean, notice that it doesn't cover a lot of the distribution. There are lots of values in the bootstrap distribution not covered by this confidence interval.
9. Quantile method for confidence intervals
If we want to include, say, ninety five percent of the values in the confidence interval, we can use quantiles of the values. To get the middle ninety five percent, we go from the point-zero-two-five-th quantile to the point-nine-seven-five-th quantile, since the difference between those numbers is point-nine-five.
To calculate the lower and upper bounds for the confidence interval, we call quantile, passing the distribution values and those quantile values. The confidence interval is from seven point five to seven point five four.
10. Inverse cumulative distribution function
There is a second method to calculate confidence intervals. To understand it, you need to be familiar with the normal distribution's inverse cumulative distribution function.
The bell curve is the probability density function, or PDF. Using calculus, if you integrate this, you get the cumulative distribution function, or CDF. Flip the x and y axes to get the inverse CDF.
qnorm takes a quantile between zero and one, and returns the values of the normal distribution for that quantile. Notice that the values corresponding to point-zero-two-five and point-nine-seven-five are about minus and plus two.
11. Standard error method for confidence interval
This second method for calculating a confidence interval is called the standard error method.
First, you calculate the point estimate, which is the mean of the bootstrap distribution, and the standard error, which is the standard deviation of the bootstrap distribution.
Then you call qnorm to get the inverse CDF of the normal distribution with the same mean and standard deviation as the bootstrap distribution. Again, the confidence interval is from seven point five to seven point five four, though the numbers differ slightly from last time since our bootstrap distribution isn't perfectly normal.
12. Let's practice!
Let's calculate some confidence intervals!