Get startedGet started for free

Formulas in R

1. Formulas in R

Multiple regression involves using multiple predictor variables. Formulas are how R relates these predictors variable to response variables. However, formulas can be tricky to get a grasp on. The purpose of this section is to introduce you to formulas for multiple regression.

2. Why care about formulas for multiple logistic regression?

Why do we care about formulas? Formulas form the backbone of regression in R. Despite being tricky to understand, once you've got a handle on them, your regression toolbox will be more powerful. The model.matrix() function powers formulas in R and understanding this function will help you understand formulas in R. Before diving into model.matrix(), you'll see the relationship between slopes and intercepts.

3. Slopes

Slopes coefficients are estimated for continuous variables. For example, this variable height. Slopes coefficients predict the linear change in a variable. In R, the formula for a model with slope also requires a global intercept. Estimating multiple slopes simply estimates a linear coefficient for each predictor

4. Intercepts

In contrast to slopes, intercepts are used to predict the effect of discrete groups. These are factors or characters in R such as the variable fish. For a single intercept, we have two options. First, we can have a reference intercept group and a contrast. In R, the formula for this would be y predicted by x. Second, we can estimate an intercept for each group. In R, this would be y is predicted by x minus 1.

5. Multiple intercepts

In contrast to multiple slopes, multiple intercepts are more complicated. When using multiple intercepts, the effects of each group are estimated compared to a reference group. In R, this is the first level in the factor, which by default is the first one alphabetically. The default formula in R estimates one reference group per variable. For example, y is predicted x1 plus x2. We can specify one group where we estimate an intercept for all groups. This can be done using the "minus 1" notation. In this case, order is important because an intercept is estimated for the members of the first group.

6. Dummy variables

R's uses dummy variables to code for group membership. Usually, these are done out-of-sight by the model.matrix() formula function in lm() or glm(). These create a matrix that uses a zero or 1 to code for group membership. For example, if a model's input was two colors, red and blue, the regression would require dummy variables. By default, we could specify an intercept and the membership in blue. Or, we could specify membership in red or blue.

7. model.matrix()

Luckily for us, we do not need to specify our own dummy variables. The model.matrix() function in R takes care of this for us. For example, you can see the default output for colors on the screen. Note that order is determined by factor. You can change the order with functions in Tidyverse or using the factor() function.

8. Factor vs numeric caveat

One last warning about model.matrix(). The function assumes numerical inputs are numbers, even if they are not. For example, perhaps you have a vector of month and you want a contrast for each month, you might try entering in "predicted by month". However, R assumes you want to treat month as a slope. The solution for this is to specify month as a factor or character in R. This now produces the desired results.

9. Let's practice!

Now, it's your turn to explore model.matrix() in R.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.