Get startedGet started for free

Adding random variables

1. Adding random variables

The most important result in probability and statistics is the central limit theorem. Let's take a look at what happens when you add random variables.

2. The central limit theorem (CLT)

The CLT states that the sum of random variables tends to a normal distribution as the number of them grows to infinity. This theorem works under certain conditions: the variables must have the same distribution, and the variables must be independent. You can start adding binomial, geometric, or even Poisson random variables, and as you add more, you get a normal distribution. Recall that random variables are independent when the outcome on one variable does not affect the outcome on the others. Let's see an example.

3. Poisson sample generation

In an example we saw previously about a busy highway with two accidents per day on average, we modeled the number of accidents per day with a Poisson random variable. Now imagine we have the data from 1,000 days. In the following animation you can see on the left the values of our population, and on the right you can see the histogram of the population values. This is our data.

4. Selection from population

Now we are going to take 10 values from our population many times, so we can calculate the sample mean of those values. Notice the red dots. Recall that when calculating the sample mean we are adding the values, and the central limit theorem applies to the sum of random variables that are equally distributed.

5. Selection from population (Cont.)

Notice the histogram of the population -- it's skewed! Now we are going to repeat this process 350 times and see the outcome.

6. Poisson sample mean plot

Take a look at these animations. At the top we have the population. We're highlighting in red the 10 randomly selected values used to calculate the sample means, and plotting those values. At the bottom left we're plotting the sample means, and at the bottom right is a histogram of the sample means. Notice that as we calculate more sample means from our population the histogram is centered at 2, which is the mean of our population, and the histogram takes on a bell shape. That is the magic of the central limit theorem. Now let's code this important result.

7. Poisson population plot

First we import poisson and describe from scipy dot stats. Then, from matplotlib we import pyplot as plt, and we import numpy as np. We generate our population with poisson dot rvs with mu equals 2, size equals 1000, and the random_state seed set to reproduce our results. Now we can plot a histogram of our population.

8. Poisson population plot (Cont.)

This is the plot. It's a Poisson skewed plot of our data. Next, let's plot the sample means.

9. Sample means plot

We first fix our random seed make the results reproducible. We define an empty list called sample_means to store the sample mean values. Then we write a for statement to loop for and arbitrarily chosen large number of samples like, 350 times. We select 10 values from our population using np dot random dot choice and then we append the sample mean of the 10 values to the sample_means list.

10. Sample means plot (Cont.)

Outside the for statement, we add labels and a title to the plot. Finally, we plot and show the histogram. We get a plot centered at 2, which is the mean of the population, with a bell shape as we expected.

11. Let's add random variables

We've finished with the most important results in probability and statistics. After exercising a bit with the central limit theorem, we will work on two applications of probability in data science, linear regression and logistic regression. Let's add random variables!