Get startedGet started for free

Transforming variables

1. Transforming variables

Sometimes, the relationship between the explanatory variable and the response variable may not be a straight line. To fit a linear regression model, you need to transform the explanatory variable or the response variable, or both of them.

2. Perch dataset

Consider the perch in the the fish dataset.

3. It's not a linear relationship

The upward curve in the mass versus length data prevents us drawing a straight line that follows it closely.

4. Bream vs. perch

To understand why the bream had a strong linear relationship between mass and length, but the perch don't, you need to understand your data. I'm not a fish expert, but looking at the picture of the bream on the left, it has a very narrow body. I guess that as bream get bigger, they mostly just get longer and not wider. By contrast, the perch on the right has a round body, so I guess that as it grows, it gets fatter and taller as well as longer. Since the perches are growing in three directions at once, maybe the length cubed will give a better fit.

5. Plotting mass vs. length cubed

Here's an update to the previous plot. The only change is that the x-axis is now length to the power three. The data points fit the line much better, so we're ready to run a model.

6. Modeling mass vs. length cubed

To model a variable raised to the power of something, there is a slight change to the way the formula is written. The caret symbol has a special meaning inside model formulas. To tell lm that you want exponentiation, you need to wrap that term inside the I function. The I function is sometimes pronounced "as is". Otherwise, everything is the same, with the response variable on the left and the explanatory variable on the right.

7. Predicting mass vs. length cubed

You create the explanatory data frame in the same way as usual. Notice that you specify the lengths, not the lengths cubed. R takes care of the transformation automatically. The code for adding predictions is the same mutate and predict combination you've seen before.

8. Plotting mass vs. length cubed

The predictions have been added to the plot of mass versus length cubed as blue points. As you might expect, they follow the line drawn by ggplot. It gets more interesting on the original x-axis. Notice how the blue points curve upwards to follow the data. Your linear model has non-linear predictions, after the transformation is undone.

9. Facebook advertising dataset

Let's try one more example using a Facebook advertising dataset. The flow of online advertising is that you pay money to Facebook, who show your advert to Facebook users. If a person sees the advert, it's called an impression. Then some people who see the advert will click on it.

10. Plot is cramped

Let's look at impressions versus spend. If we draw the standard plot, the majority of the points are crammed into the bottom-left of the plot, making it difficult to assess whether there is a good fit or not.

11. Square root vs square root

By transforming both the variables with square roots, the data are more spread out throughout the plot, and the points follow the line fairly closely. Square roots are a common transformation when your data has a right-skewed distribution.

12. Modeling and predicting

Running the model and creating the explanatory dataset are the same as usual. You don't need to wrap the square root term in the model formula in I; that's only needed for exponentiation. Prediction requires an extra step. Because we took the square root of the response variable (not just the explanatory variable), the predict function will predict the square root of the number of impressions. That means that we have to undo the square root by squaring the predicted responses. Undoing the transformation of the response is called backtransformation.

13. Let's practice!

Time to transform some variables.