Get startedGet started for free

Categorical inputs

1. Categorical inputs

In this lesson you will see how categorical variables are represented in R models.

2. Example: Effect of Diet on Weight Loss

As an example, suppose we want to model weight loss over 24 months as a function of diet, as well as age and body mass index at the start of the study. Diet is a categorical variable with three possible values: Mediterranean, Low-Carb, and Low-Fat.

3. model.matrix()

In many R modeling functions, categorical variables with N possible levels are represented under the covers as N-1 0/1, or indicator, variables. You can see the representation R uses via the model-dot-matrix call, which takes a formula and a data frame as inputs.

4. Indicator Variables to Represent Categories

In our example, the Diet variable becomes two indicators, one for Low-Fat and another for Mediterranean. The left-out level, Low-Carb, is the default, or reference level. Converting a categorical variable into a set of indicator variables is sometimes called "one-hot-encoding".

5. Interpreting the Indicator Variables

Under the covers, the lm function solves a numerical problem, with a coefficient for every indicator variable, as shown in the formula here. Recall that for a linear model the Intercept is the predicted value when all the variables equal 0. In our example, this means that the Intercept is the value when Age and BMI are zero, and Diet is Low-Carb. The coefficients for the Mediterranean and Low-Fat diets represent the change in weight loss from being on the low-carbohydrate diet for a person of the same age and bmi. For example, a person on the Mediterranean diet on average lost a kilogram less than a person of the same age and BMI on the Low-Carb diet. Because a categorical

6. Issues with one-hot-encoding

variable with K possible values encodes to K-1 variables, using variables like ZIP code with many many possible values can be dangerous, both for computational reasons and because too many variables relative to the number of datums can lead to overfit. To deal with this problem some algorithms encode the levels as numbers. In our example, we might code the diets as the values 1, 2, and 3. This is not recommended with algorithms like linear regression that treat datums as points in a geometric space, because it can lead to misleading results. Luckily, in R you don't usually have to worry about encoding categoricals, as most R functions do that for you. However, it's still important to be aware of the conversion and how to interpret it. Now

7. Let's practice!

let's practice working with categorical variables.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.