1. Confidence intervals
In this chapter, we'll focus on statistical experiments and hypothesis testing - a crucial element of any statistics interview for data scientists. We'll start things off simple with confidence intervals and hypothesis testing before we move into more complex topics like multiple testing and the power versus sample size trade off.
2. Intro to sampling
Before we dive into confidence intervals, let's review the concept of sampling. A sample is a collection of data from a certain population that is meant to represent the whole. As we see here, it will usually make up only a small portion of the total. The idea is that we can make conclusions about the sample and generalize it to a broader group.
3. What is a confidence interval?
Simply put, a confidence interval is a range of values that we are fairly sure includes the true value of an unknown population parameter. It has an associated confidence level that represents the frequency in which the interval will contain this value.
So, if we have a 95 percent confidence interval, this means that 95 times out of 100, we can expect our interval to hold the true parameter value of the population.
This question seems simple enough, but it can be difficult to articulate things like this in simple terms. Make sure you're practicing even the seemingly straightforward questions like this, and then take it a step further. Why do we even use confidence intervals?
4. Calculating confidence intervals
Computing confidence intervals can be pretty straightforward once you get the hang of it, though it differs slightly if you're working with means or proportions.
For means, you take the sample mean then add and subtract the appropriate z-score for your confidence level with the population standard deviation over the square root of the number of samples. Note that this takes a slightly different form if you don't know the population variance.
5. Calculating confidence intervals
For proportions, similarly, you take the mean plus minus the z score times the square root of the sample proportion times its inverse, over the number of samples.
Both of these formulas are alike in the sense that they take the mean plus minus some value that we compute. This value is referred to as the margin of error. Adding it to the mean gives us the upper threshold of our interval, whereas subtracting it from the mean gives us our lower threshold.
6. Example: means
Now let's get into an example implementing confidence intervals in python. There are a couple different ways to do this, but we'll use the scipy stats package and it's interval function, where we pass the confidence level, number of values, mean of our sample, and then the standard error computed with the sem function.
In this scenario, our sample of 10, 11, 12, and 13 gives us a 95 percent confidence interval of 9 point 45 to 13 point 55; meaning that 95 times out of 100, the true mean should fall in this range.
7. Example: proportions
Similarly, we'll use another function for proportions. We can pass the proportion-underscore-confint function the number of successes, number of trials, and the alpha value represented by 1 minus our confidence level. Here we can see a 95 percent confidence interval for 4 successes out of 10 trials.
8. Summary
Let's summarize what we learned. We covered sampling, confidence intervals and how to calculate them, and then we walked through an example.
9. Let's prepare for the interview!
Let's practice implementing confidence intervals in the exercises!