Get startedGet started for free

Adding fixed variables

1. Adding predictive variables

In the previous chapter, you learned how to construct an early stage basetable, with the population and target. In this chapter, you will learn how to correctly add predictive variables to the basetable, that can be used to predict the target.

2. Predictive variables

Predictive variables can be any information that can help you to predict the target. One important group of predictive variables that is used in most predictive models are demographics, like age, gender or living place. Other information that can be used mostly depends on the project definition, and could be anything ranging from spending behaviour, watching behaviour, product usage, clicks on a website to payment information. In general, it is smart to add as many variables as you can to the basetable and use a variable selection algorithm to select the most relevant variables like you learned in the first predictive analytics course.

3. Timeline compliant predictive variables (1)

As always, it is crucial that the predictive variables are compliant with the timeline. Assume today is March 5th 2019 and you want to construct a predictive model that predicts whether a donor will donate a certain amount in April 2019. To construct the basetable, we reconstruct the timeline in the past.

4. Timeline compliant predictive variables (2)

As the target is calculated using information as from April 1st 2018, you need to make sure that the predictive variables are based on information known before April 1st 2018. Indeed, if you are going to use this model, information after April 1st 2019 is not available either, so using it in the predictive variables would over-estimate the true performance of your model.

5. Adding lifetime

Assume for instance that you want to add lifetime to the basetable, the number of days since someone became donor. Given is an early stage basetable in a pandas dataframe with the population, and for each donor the date on which he or she became member. To calculate the lifetime, we should keep in mind the timeline and not calculate the lifetime today, but calculate the lifetime on the reference date, which is the start of the target period, April 1st 2018. The lifetime can then easily be calculated as the difference between the reference date and the member_since date. Note that we should also make sure that the member_since date is before April 1st 2018, but this should already be in order as the population was constructed according to the timeline.

6. Adding preferred contact channel (1)

Let's consider another example where you want to add the preferred contact channel (phone or e-mail) to the basetable. This preferred contact channel is something that the donor can adjust over time, so it is important to keep the timeline in mind when adding this variable. Given is an early stage basetable with the population, and a pandas dataframe `contact_channel` that has for each donor the preferred contact channel and the start and end valid date of this contact channel. As you can see on the timeline, the reference date is again April 1st 2018, so you first need to filter the lines in the `contact_channel` dataframe that have a valid start and end date.

7. Adding preferred contact channel (2)

Next, you can left join the basetable with the contact_channel table to add the right contact channel to the basetable. Of course, you need to make sure that the donor IDs in the contact_channel table are unique.

8. Let's practice!

Time to put this into practice.