1. Transforming inputs before modeling
Last lesson, you explored transforming the output before modeling. In this lesson, you will explore transforming input variables before modeling.
2. Why To Transform Input Variables
There are many reasons why you might want to transform input variables before modeling. The most important reason is that you have domain knowledge that tells you that a transformed variable may be more informative than the variables that you have. For instance, animal intelligence is related to the ratio of brain mass to body mass to the two-thirds, rather than to brain or body mass directly.
3. Why To Transform Input Variables
You may also want to transform variables for pragmatic reasons, to make the variable easier to model with. The log transform we saw in the previous lesson is again useful, especially for monetary input variables.
4. Why To Transform Input Variables
Or you may want to transform variables to meet modeling assumptions, like linearity.
5. Example: Predicting Anxiety
As an example, here we see a plot that relates a person's level of anxiety to the number of hassles they experience in a day. The relationship is clearly non-linear: when a person has experienced only a few hassles, another hassle may not affect their anxiety much; but when they are having a day filled with problems, the next hassle may have a much greater impact on their emotions.
So if you wanted to fit a linear regression model for anxiety that included hassles as a variable, you would have a problem.
6. Transforming the hassles variable
We can try fitting a model not only to hassles, but to hassles squared or hassles cubed. As the graph shows, both hassles squared or cubed show a better fit to the data than a linear fit.
But which model is best?
7. Different possible fits
There are many different transformations of the hassles variable that might give us the shape we observed in the data. If we have a domain knowledge for preferring one, then that's how we should pick. But if we don't know, and are mostly concerned with accurate prediction, then we should pick the one that seems to give us the lowest prediction error.
Note that when you raise a variable to a power in a formula, you should use the I function to treat the expression mathematically, not as an interaction.
8. Compare different models
We can fit models to hassles, hassles squared, and hassles cubed, and compare. In this case, the cubic model has the best r-squared.
9. Compare different models
We should also validate the models' out-of-sample performance, in this case by cross-validation. The cubic model still looks the best.
Note that we don't know that hassles literally affect anxiety in a cubic way; we only know that of the three models we tried, the cubic model seems to predict the best.
In a later lesson, we will see regression algorithms that can learn some of these input variable transforms automatically, so that the data scientist doesn't always have to specify them.
10. Let's practice!
Now let’s practice transforming input variables and modeling with them.