1. Transformations
In the previous videos, you learned how to prepare the data so that it can be used for predictive modeling. You will now take it one step further, and learn how to improve the predictive model by adding transformations of the predictive variables to the basetable.
2. Motivation for transformations
Consider a basetable with a predictive variable that is the sum of all donations of a donor in the past. You can imagine that for some donors, this amount can be really huge, while for other donors, it is rather low.
Consider two donors: Alice has donated 100 Euros, and Bob has donated 1,100 Euros. On the other hand, consider Carol who has donated 10,000 Euros, and Dave who has donated 11,000 Euros.
It is clear that the difference between Alice and Bob is more important than the difference between Carol and Dave. Indeed, even though the absolute difference between the amounts donated is the same, the 1,000 Euros difference is relatively more important in the first case.
3. Log transformation
For this variable, it would be interesting to stress the differences between small amounts, and to to fade out the differences between larger amounts. One way to do this, is to take the log transformation of the original variable. As you can see, the log transformation makes sure that differences between small amounts are larger, and differences between larger amounts are smaller.
4. Log transformation
In Python, you can easily calculate the log transformation of a variable using the `log` method in the `numpy` package.
Taking the log transformation of a variable is especially interesting for variables that can take both very small and very large values, as is the case for amounts or recency variables. Besides log transformations, you can also take the inverse or square of the variable.
5. Interactions
Another type of transformations is calculating interactions between variables. Consider the predictive variables `number_donations` that has the number of donations last year, and the predictive variable `recency`, the time since the last donation. Both variables have predictive power, but the variables can also reinforce each other. For instance, donors that made many donations last year and that did not donate in a while, are very likely to donate again soon. On the other hand, donors that very recently donated, but did not donate often last year, are very unlikely to donate soon again.
6. Interactions in Python
One way to incorporate this type of interactions in your predictive model, is by adding interaction terms, that is, by adding the product of variables as a variable to the basetable. In some cases, this can significantly improve the performance of your model.
Adding interactions comes with some warnings though. It is not a good idea to add all interactions to the model, as the number of variables will explode and the model will be less interpretable. A good strategy is to try to include interactions with variables that have very high predictive power on their own, or to discuss with domain experts which interactions might be interesting. However, this does not mean that only interactions with highly predictive variables are useful. Sometimes, variables with very low predictive power can have an interesting interaction with other variables. But, chances are lower than for highly predictive variables.
7. Let's practice!
Now let's try some examples.