Transforming the response before modeling

1. Transforming the response before modeling

Sometimes you get better models by transforming the output, rather than predicting the output directly. In this lesson you will see an example of modeling a transformed output, and transforming the predictions back into the original space.

2. The Log Transform for Monetary Data

For prediction, a useful transformation is log transforming monetary data, like income or profits. Monetary values are often lognormally distributed, which means that they tend to be skewed, with a long tail. Lognormal distributions also typically have a wide dynamic range: in this data, incomes range from 60 dollars to over 700,000. Wide dynamic ranges can also make prediction more difficult.

3. Lognormal Distributions

With lognormal data, the mean value (shown here in green) is generally greater than the median value (shown here in orange). You can think of the median value of an income distribution as representing a typical income: half the subjects have higher income, and half have lower. Regression algorithms usually predict the expected, or mean, value of the output. This means that predicting income directly will tend to overpredict the typical income for subjects with a given set of characteristics.

4. Back to the Normal Distribution

If you take the log of lognormally distributed data, the resulting data is normally distributed. This means the mean tracks the median, and the dynamic range of the data is more modest. Here, we look at the previous income data on a log scale. The distribution looks closer to a normal, and the mean and median

5. The Procedure

are close together. The modeling procedure is to first fit a model to log outcome. Next apply the model

6. The Procedure

to the data you want predictions for. Finally,

7. The Procedure

exponentiate to transform the predictions back to outcome space. One consequence of log-transforming

8. Predicting Log-transformed Outcomes: Multiplicative Error

outcomes is that prediction errors are multiplicative in outcome space. This means the size of the error is relative to the size of the outcome. We can define relative error as the prediction error divided by the true outcome. Log transformations are useful when you want to reduce relative error, rather than additive error. Predicting monetary amounts is one such example. Being off in your prediction by $100 is quite different when the true amount is $250 rather than $10,000. Let's define a measure

9. Root Mean Squared Relative Error

called root mean squared relative error, by analogy to RMSE. A model that predicts log-outcome will often have lower RMS-relative error and larger RMSE than a model that predicts outcome directly. Let's see this by example.

10. Example: Model Income Directly

Here we want to predict a person's income, based on their education and the score from a proficiency test taken about 25 years before the survey. We want to compare a model to predict income to a model that predicts log income. First, we'll model income directly.

11. Model Performance

We can evaluate the model’s error and relative error on new data. Now

12. Model log(Income)

let's fit a model that predicts log income. Now transform the

13. Model Performance

predictions back into outcome units, and evaluate this model's error and relative error on new data.

14. Compare Errors

We can see that modeling the income directly gave a smaller RMSE, but modeling log income gave smaller relative error. Now let's

15. Let's practice!

practice predicting log outcome and calculating error and relative error.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Supervised Learning in R: Regression

IntermediateSkill Level

4.6+

50 reviews