Get startedGet started for free

Central limit theorem

1. Central limit theorem

We've established a solid base with conditional probabilities. Now, let's get into central limit theorem or CLT: what it is, why it's important, and how to visualize it in python.

2. What does it mean?

Central limit theorem says that with a large enough collection of samples from the same population, the sample means will be normally distributed. Note that this doesn't make any assumptions about the underlying distribution of the data; with a reasonably large sample of roughly 30 or more, this theorem will always ring true no matter what the population looks like.

3. Why does it matter?

Central limit theorem matters because it promises our sampling mean distribution will be normal, therefore we can perform hypothesis tests. More concretely, we can assess the likelihood that a given mean came from a particular distribution and then, based on this, reject or fail to reject our hypothesis. This empowers all of the A/B testing you see in practice. For this reason, interviewers love this topic. Be sure to have a well-thought-out answer prepared.

4. Law of large numbers

It's also worth mentioning that this is different than the law of large numbers. The law of large numbers states that as the size of a sample is increased, the estimate of the sample mean will more accurately reflect the population mean. We see this here with the purple, red, and gold distributions representing small, medium, and large samples, respectively. This is different from the central limit theorem, though it's easy to get mixed up in a high-stress interview setting.

5. Simulating CLT in Python

We can run a simulation in python to get the following plot showing rolls of a normal six-sided die. In order to do this, we'll utilize the numpy randint function where we input the start, end, and number of values that we want to randomly generate, along with the numpy mean function. The sample means don't look like much at first here, but they slowly become more and more normal around the true mean of 3-point-5, thanks to the central limit theorem at work. This simple matplotlib histogram shows only rolls 1 through 100, but you can imagine how this would continue if we upped the number of trials.

6. List comprehension

Before we wrap up, let's cover list comprehension. List comprehension is a pretty cool python trick that comes in handy for setting up these numpy simulations and certain coding interview questions. Here you see a snippet of some code that's designed to take in our list and square each value. List comprehension tightens this up by allowing you to execute your for loop in only one line, giving us the same answer.

7. Summary

Wrapping things up, let's summarize what we learned. We talked about central limit theorem, what it is and why it matters, we touched on the law of large numbers, looked at a simulation of CLT in python and finally, went over list comprehension. Remember, interviewers love central limit theorem, and it's really fundamental to data science, so it's worth gaining a certain level of familiarity with the topic.

8. Let's prepare for the interview!

But enough on CLT for now, let's get to some coding exercises!