Get startedGet started for free

Adding aggregated variables

1. Adding aggregated variables

The variables that you learned to construct in the previous video, are fixed in time. In this video, you will see how to add variables to the basetable that are aggregated over time.

2. Motivation for aggregated variables (1)

Consider the problem where you want to predict whether a donor will donate more than 500 Euro in the next year. What could be a good predictive variable for this problem?

3. Motivation for aggregated variables (2)

Intuitively, we understand that if the donor donated a very low amount the year before, chances are low that he will suddenly donate more than 500 Euro in the next year. Analogously, if a donor donated more than 1000 Euro the year before, chances are high that he might donate more than 500 Euro again next year. This does not only hold for the donation problem, but for predictive models in general: very often, the best predictive variable to predict a future event, is to check whether the event occured before.

4. Adding total value last year (1)

So let's try to put this into practice and add the total value someone donated last year to the basetable. Given is a pandas dataframe with the gifts of donors. We need to keep in mind the timeline. In this problem, the reference date is January 1st 2017, so the start date of the period we want to sum the gifts over is January 1st 2016, and then end date is January 1st 2017. First we select the gifts made in this period.

5. Adding total value last year (2)

Next, we need to sum these gifts per donor. In Python, you can do that using the groupby method with the donor ID as argument. From this grouped structure, you should take the amount and sum it. We rename the columns of this grouped dataframe appropriately. Once we have the sum of donations for each donor, we can add these to the basetable. Therefore, we use the merge method that does a left join if we use the argument how equals left.

6. Adding number of donations to the basetable

It is not necessary to exactly reconstruct the target as predictive variable, other aggregated variables can be interesting as well. For instance, one can add the number of donations made by a donor, in the last year before the reference date. This might even be a better indicator for donations in the next year as it expresses how regularly a donor donates. The procedure in Python is similar: except that instead of summing the donations, we count the number of donations made in 2016. Finally, the new variable is added to the basetable.

7. Let's practice!

Time to put this into practice. Let's add some other aggregated variables to the basetable.