1. Welcome to the course!
Hello, and welcome to Inference for Numerical Data!
My name is Mine Cetinkaya-Rundel, and in this course you will learn concepts that are essential for conducting inference on numerical data and the associated R code for doing so.
We'll begin with using bootstrapping techniques to conduct inference on a single parameter of a numerical distribution.
Let's get to it!
2. Rent in Manhattan
On a given day, twenty 1 BR apartments were randomly selected on Craigslist Manhattan from apartments listed as "by owner", as opposed to by a rental agency. First, let's take a look at the distribution of these rents. The distribution is unimodal and right skewed.
Then, is the mean or the median a better measure of typical rent in Manhattan?
Since the distribution is right skewed, median is a better measure of typical rent.
3. Bootstrapping techniques
Assuming that this sample is representative of the population of all one bedroom apartments in Manhattan, which is a bit unlikely since these data come from only one classifieds website, we can use bootstrapping techniques to estimate the median rental price of one bedrooms apartments in Manhattan.
Remember that the term bootstrapping comes from the phrase pulling oneself up by one's bootstraps, which is a metaphor for accomplishing an impossible task without any outside help. In this case the impossible task is estimating the population parameter using data from only the given sample. Note that this is what statistical inference is all about -- we have a sample, and we use that sample to make inferences about the unknown population.
4. Observed sample
Here's our original sample of 20 apartments and their rents. The sample median is two thousand three hundred and fifty dollars.
Using this sample, we want to estimate the population median and we will do so via bootstrapping. Remember, in bootstrapping we take random samples from the original sample with replacement.
5. Bootstrap population
We sample with replacement because we believe that for every observation in the sample, there are more like it in the population. So we can think of our bootstrap population as a population where each observation from the sample appears many times. And then we take many samples from this population to understand what medians of samples from the original population would look like, if in fact we had the resources to take many samples from the population.
6. Bootstrapping scheme
How does this work in practice?
We first take a bootstrap sample: a random sample taken with replacement from the original sample, and of the same size as the original sample.
Then we calculate the bootstrap statistic for this sample. Remember in this example we're interested in the median, but we could use the same scheme for a mean, a proportion, a standard deviation, a slope, etc.
We repeat steps one and two many times to create the bootstrap distribution. This is a distribution of bootstrap statistics. This is actually just like creating the sampling distribution, but there is one big difference: in bootstrapping we are taking samples from the original sample instead of from the population.
7. Bootstrapping scheme, in R
Next let's discuss is implementing bootstrapping in R using the infer package. We can construct the bootstrap distribution in one pipe.
We start with our data frame, and first specify the variable of interest, which in this case is rent.
8. Bootstrappping scheme, in R
Then we generate bootstrap samples, many many of them!
9. Bootstrapping scheme, in R
And finally we calculate the statistic in each one of these samples, which in this case is the median.
10. Constructing the bootstrap interval
The result is the bootstrap distribution
11. Constructing the bootstrap interval
and using this distribution we determine the bounds of the confidence interval.
12. Let's practice!
Now let's put what you've learned so far to use.