Categorical explanatory variables

1. Categorical explanatory variables

So far we looked at running a linear regression using a numeric explanatory variable. Now let's look at what happens with a categorical explanatory variable.

2. Fish dataset

Let's take a look at some data on the masses of fish sold at a fish market. Each row of data contains the species of a fish, and its mass. The mass will be the response variable.

3. Visualizing 1 numeric and 1 categorical variable

To visualize the data, scatter plots aren't ideal because species is categorical. Instead, we can draw a histogram for each of the species. Because the dataset is fairly small I set the bins argument of geom_histogram() to just nine. To give a panel for each species, I used facet_wrap(). This takes the name of the variable to split on, wrapped in the vars() function.

4. Summary statistics: mean mass by species

Let's calculate some summary statistics. First we group by species then we summarize to calculate their mean masses. You can see that the mean mass of a bream is six hundred and eighteen grams. The mean mass for a perch is three hundred and eighty two grams, and so on.

5. Linear regression

Let's run a linear regression using mass as the response variable and species as the explanatory variable. The syntax is the same: you call lm(), passing a formula with the response variable on the left and the explanatory variable on the right, and setting the data argument to the data frame. This time we have four coefficients: an intercept, and one for three of the fish species. A coefficient for bream is missing, but the number for the intercept looks familiar. The intercept is the mean mass of the bream that you just calculated. You might wonder what the other coefficients are, and why perch has a negative coefficient, since fish masses can't be negative. The coefficients for each category are calculated relative to the intercept. If you subtract two hundred and thirty five point six from six hundred and seventeen point eight, you get three hundred and eighty two, which is the mean mass of a perch. This way of displaying results can be useful for models with multiple explanatory variables, but for simple linear regression, it's just confusing. Fortunately, we can fix it.

6. No intercept

By changing the formula slightly to append "plus zero", we specify that all the coefficients should be given relative to zero. Equivalently, it means we are fitting a linear regression without an intercept term. Now these coefficients make more sense. They are all just the mean masses for each species. This is a reassuringly boring result. When you only have a single, categorical explanatory variable, the linear regression coefficients are the means of each category.

7. Let's practice!

Time for you to try it.

This exercise is part of the course

Introduction to Regression in R

IntermediateSkill Level

4.8+

Start Course for Free

You’ll learn the basics of this popular statistical model, what regression is, and how linear and logistic regressions differ. You’ll then learn how to fit simple linear regression models with numeric and categorical explanatory variables, and how to describe the relationship between the response and explanatory variables using model coefficients.

Exercise 1: A tale of two variables Exercise 2: Which one is the response variable?Exercise 3: Visualizing two variables Exercise 4: Fitting a linear regression Exercise 5: Estimate the intercept Exercise 6: Estimate the slope Exercise 7: Linear regression with lm()Exercise 8: Categorical explanatory variables

Current Exercise

Exercise 9: Visualizing numeric vs. categorical Exercise 10: Calculating means by category Exercise 11: lm() with a categorical explanatory variable

In this chapter, you’ll discover how to use linear regression models to make predictions on Taiwanese house prices and Facebook advert clicks. You’ll also grow your regression skills as you get hands-on with model objects, understand the concept of "regression to the mean", and learn how to transform variables in a dataset.

Exercise 1: Making predictions Exercise 2: Predicting house prices Exercise 3: Visualizing predictions Exercise 4: The limits of prediction Exercise 5: Working with model objects Exercise 6: Extracting model elements Exercise 7: Manually predicting house prices Exercise 8: Using broom Exercise 9: Regression to the mean Exercise 10: Home run!Exercise 11: Plotting consecutive portfolio returns Exercise 12: Modeling consecutive returns Exercise 13: Transforming variables Exercise 14: Transforming the explanatory variable Exercise 15: Transforming the response variable too

In this chapter, you’ll learn how to ask questions of your model to assess fit. You’ll learn how to quantify how well a linear regression model fits, diagnose model problems using visualizations, and understand the leverage and influence of each observation used to create the model.

Exercise 1: Quantifying model fit Exercise 2: Coefficient of determination Exercise 3: Residual standard error Exercise 4: Visualizing model fit Exercise 5: Residuals vs. fitted values Exercise 6: Q-Q plot of residuals Exercise 7: Scale-location Exercise 8: Drawing diagnostic plots Exercise 9: Outliers, leverage, and influence Exercise 10: Leverage Exercise 11: Influence Exercise 12: Extracting leverage and influence

Learn to fit logistic regression models. Using real-world data, you’ll predict the likelihood of a customer closing their bank account as probabilities of success and odds ratios, and quantify model performance using confusion matrices.

Exercise 1: Why you need logistic regression Exercise 2: Exploring the explanatory variables Exercise 3: Visualizing linear and logistic models Exercise 4: Logistic regression with glm()Exercise 5: Predictions and odds ratios Exercise 6: Probabilities Exercise 7: Most likely outcome Exercise 8: Odds ratio Exercise 9: Log odds ratio Exercise 10: Quantifying logistic regression fit Exercise 11: Calculating the confusion matrix Exercise 12: Measuring logistic model performance Exercise 13: Accuracy, sensitivity, specificity Exercise 14: Congratulations