1. Bayesian regression with a categorical predictor
In this final chapter, we'll generalize our foundational Bayesian models for application in broader contexts.
2. Chapter 4 goals
Specifically, we'll learn how to incorporate categorical predictors into Bayesian models
and to engineer multivariate Bayesian regression models.
We'll finally extend our methodology for Normal regression models to the generalized linear model setting, Poisson regression in particular.
3. Rail-trail volume
Throughout, we'll explore the daily volume on a Massachusetts rail-trail - a rail track that's been converted into a path for cycling, jogging, and walking.
4. Modeling volume
Let $Y_i$ denote the trail volume, or number of trail users, on day $i$.
We'll assume for now that volume varies Normally from day to day around some average $m_i$ with standard deviation $s$.
5. Modeling volume by weekday
Some of the variability in trail volume might be explained by the day of the week. Let $X_i$ indicate weekday status - 1 for weekdays & 0 for weekends. Thus $X_i$ is a categorical variable with 2 levels.
6. Modeling volume by weekday
We expect that the trend in volume varies for these 2 levels. For example, volume might tend to be higher on weekends (level 0 in red) than on weekdays (level 1 in black).
7. Modeling volume by weekday
We can represent the dependence of the trend in volume $m_i$ on weekday status $X_i$ by $a + b X_i$, just as for our previous regression models. Yet since $X_i$ is categorical, the interpretation of regression parameters $a$ and $b$ differs.
8. Modeling volume by weekday
First, for weekends, $X_i$ is 0 and $m_i$ reduces to $a$. Thus $a$ represents the typical weekend volume.
9. Modeling volume by weekday
For weekdays, $X_i$ is 1 and $m_i$ simplifies to $a + b$. Thus this sum represents the typical weekday volume.
It follows that $b$ is the contrast between the typical weekday vs weekend volume.
Finally, $s$ measures the residual standard deviation, or deviation from the trend.
10. Priors for $a$ & $b$
The rail-trail managers suggest some prior models. They know that $a$, the typical *weekend* volume, is most likely around 400 users per day, but possibly as low as 100 or as high as 700 users.
They lack certainty about $b$, how weekday volume compares to weekend volume. It could be more, it could be less. Thus the prior for $b$ ranges from roughly 800 fewer to 800 more users per day.
11. Prior for $s$
Finally, the standard deviation in volume from day to day (whether on weekdays or weekends) is equally likely to be anywhere between 0 and 200 users.
12. Bayesian model of volume by weekday status
In summary, we have the following Bayesian model of rail-trail volume $Y_i$ by categorical weekday status $X_i$.
13. DEFINE the Bayesian model in RJAGS
To define this model in RJAGS, let's start with the familiar pieces.
14. DEFINE the Bayesian model in RJAGS
We can define the `Y[i]`, `a`, and `s` models as before. Where we need a new strategy due to the categorical nature of `X`, is in the definition of trend `m[i]` and the `b` prior. There are several strategies that we could use here. We'll use one that easily scales up for categorical variables with more than 2 levels.
15. DEFINE the Bayesian model in RJAGS
First, in defining trend `m[i]`, notice the use of square brackets to specify the dependence of parameter `b` on `X[i]`.
Here `X[i]` has two levels, weekend and weekday, labelled "1" and "2" in RJAGS.
Thus `b` has 2 corresponding levels, `b[1]` and `b[2]`.
Putting this together, the weekend trend is represented by `a + b[1]`.
16. DEFINE the Bayesian model in RJAGS
To ensure that this matches our Bayesian model in which `a` alone represents weekend trend, `b[1]` is set 0. This level of `b` simply acts as a reference or baseline.
17. DEFINE the Bayesian model in RJAGS
In contrast, the weekday trend is represented by the sum `a + b[2]` in RJAGS where `b[2]` corresponds to the original b parameter with a Normal prior.
18. Let's practice!
Let's pause and play around with categorical variables in RJAGS.