1. Poisson and quasipoisson regression to predict counts
In this lesson, you will learn about regression to predict count data.
2. Predicting Counts
Predicting counts is a non-linear problem, because counts are restricted to being non-negative and integral.
The counterpart to linear regression for count data is poisson or quasipoisson regression.
3. Poisson/Quasipoisson Regression
Poisson regression is also a generalized linear model. It assumes that the inputs are additive and linear with respect to the log of the outcome. In R, you will again use the glm function, with family either poisson or quasipoisson.
4. Poisson/Quasipoisson Regression
For Poisson regression, the outcome is an integer that represents a count or a rate, like the number of traffic tickets a driver gets in a year, or the number of visits to a website per day. The model returns an expected rate that is not necessarily an integer, for instance the expected number of visits to a website per day.
5. Poisson vs. Quasipoisson
Poisson regression assumes that the process producing the count has a poisson distribution, where the mean equals the variance. For many real-life processes, the variance will be quite different from the mean. In this case, you should use quasipoisson regression.
Poisson and quasipoisson regression work better on larger datasets.
If the counts that you want to predict are much larger than zero, doing regular regression will often be fine, as well.
Let's see an example.
6. Example: Predicting Bike Rentals
Here we have hourly data from a bike sharing system in Washington, D.C., detailing the number of bikes rented during the first 2 weeks of January. We want a model to predict the hourly bike rental counts as a function of the time of day, the type of day (workday, weekend, or holiday), and details about the weather.
7. Fit the model
We’ll use January data for training. Because the variance of the bike rentals is much larger than the mean, we will use quasipoisson regression.
8. Check model fit
As with logistic regression, we can use pseudo-R-squared to check the goodness of fit. We can get the deviance and null deviance of the model using glance. This model explains about 76% of the deviance, not too bad.
9. Predicting from the model
The predict function takes the model and the data to prediction. As with logistic regression, be sure to use type = "response" to get the predicted rates. Here, we apply the model to data from February, and also plot the predicted hourly rates versus the actual counts.
10. Evaluate the model
We can calculate the root mean squared error of the model for February. On average, the prediction is off by about 69 rentals each hour, about half the standard deviation of the rental rates from the mean.
11. Compare Predictions and Actual Outcomes
We can also compare the predictions to actual outcomes as a function of time.
12. Let's practice!
Let’s now practice fitting and predicting from count models.