Categorical explanatory variables

1. Categorical explanatory variables

So far we looked at running a linear regression using a numeric explanatory variable. Now let's look at what happens with a categorical explanatory variable.

2. Fish dataset

Let's take a look at some data on the masses of fish sold at a fish market. Each row of data contains the species of a fish, and its mass. The mass will be the response variable.

3. Visualizing 1 numeric and 1 categorical variable

To visualize the data, scatter plots aren't ideal because species is categorical. Instead, we can draw a histogram for each of the species. Because the dataset is fairly small I set the bins argument of geom_histogram() to just nine. To give a panel for each species, I used facet_wrap(). This takes the name of the variable to split on, wrapped in the vars() function.

4. Summary statistics: mean mass by species

Let's calculate some summary statistics. First we group by species then we summarize to calculate their mean masses. You can see that the mean mass of a bream is six hundred and eighteen grams. The mean mass for a perch is three hundred and eighty two grams, and so on.

5. Linear regression

Let's run a linear regression using mass as the response variable and species as the explanatory variable. The syntax is the same: you call lm(), passing a formula with the response variable on the left and the explanatory variable on the right, and setting the data argument to the data frame. This time we have four coefficients: an intercept, and one for three of the fish species. A coefficient for bream is missing, but the number for the intercept looks familiar. The intercept is the mean mass of the bream that you just calculated. You might wonder what the other coefficients are, and why perch has a negative coefficient, since fish masses can't be negative. The coefficients for each category are calculated relative to the intercept. If you subtract two hundred and thirty five point six from six hundred and seventeen point eight, you get three hundred and eighty two, which is the mean mass of a perch. This way of displaying results can be useful for models with multiple explanatory variables, but for simple linear regression, it's just confusing. Fortunately, we can fix it.

6. No intercept

By changing the formula slightly to append "plus zero", we specify that all the coefficients should be given relative to zero. Equivalently, it means we are fitting a linear regression without an intercept term. Now these coefficients make more sense. They are all just the mean masses for each species. This is a reassuringly boring result. When you only have a single, categorical explanatory variable, the linear regression coefficients are the means of each category.

7. Let's practice!

Time for you to try it.