Count data and Poisson distribution

1. Count data and Poisson distribution

In the previous chapter we talked about modeling binary data i.e. the probability of an occurrence of the event. In this chapter we are still concerned about the occurrence of an event, but this time we don't want to measure the occurrence as a binary value, but rather

2. Count data

count the number of occurrences in a specified unit of time, distance, area or volume. Examples of such measurements are number of goals in a soccer match, number of earthquakes in a certain region, number of crab satellite in the nest and so on.

3. Poisson random variable

Such measurements would constitute Poisson random variable if events occur independently and randomly, meaning the probability that an event occurs in a given unit of time does not change through time. Then we say that the random variable follows a Poisson distribution with parameter lambda, which describes both the mean and the variance. The events are always positive, discrete, not continuous and can range from zero to infinity since counts cannot be negative. Hence count data have a lower bound at zero but no upper bound. For comparison, the normal distribution has no bounds. In addition, counts can have many zero observations and be right-skewed, which add to the reasons why we wouldn't use the linear model to model count data.

4. Understanding the parameter of the Poisson distribution

Let's see how the Poisson distribution changes as we vary the parameter lambda. The following figures show lambda equal to 1, 5 and 10 respectively. Notice that when lambda is 1 the distribution is highly skewed, but as we increase lambda the distribution spreads and becomes more symmetric.

5. Visualizing the response

In Python you can plot your response data to visually check the distribution shape using the seaborn library and its distplot function, as shown in the previous slide.

6. Poisson regression

Now we have a proper basis to define the Poisson regression model. Starting with the response y, which is count, we assume they are Poisson random variables, Note that tilde means distributed as. We want to model the expected value of y, i.e. lambda. Recall that y has a constraint that it can only be positive where lambda will also be positive. To remove this constraint we take the logarithm where the log of lambda then takes values from minus infinity to infinity. This defines the Poisson regression model, which is a linear combination of the parameters.

7. Explanatory variables

The explanatory variable x can be a combination of continuous and categorical variables. If all the explanatory variables are categorical then in the literature the model is referred to as the log-linear model.

8. GLM with Poisson in Python

In Python we can fit a GLM with Poisson using the already familiar glm function from the statsmodels library. However, for count data we need to use the Poisson distribution for the family argument. The default link function is the logarithm. The formula and data arguments are the same as in logistic and linear regression.

9. Let's practice!

Now let's practice building and assessing Poisson regression models.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.