Get startedGet started for free

Missing values

1. Missing values

Very often, not all data is available in the basetable: for certain persons or objects, values can be missing in the descriptive variables. This can be due to different reasons like issues in data collection, privacy concerns or others.

2. Replacing missing values by an aggregate (1)

There are many techniques to replace missing values. Often, it is sufficient to replace the missing value by an aggregate of the remaining values.

3. Replacing missing values by an aggregate (2)

For instance, if the age of a donor is missing, one can assume that replacing the missing values with the mean age of the remaining donors will not alter the model drastically.

4. Replacing missing values by an aggregate (3)

In some cases, using the mean is tricky because extreme values can drastically influence this value. For instance, if the maximum donation someone gave is missing, it is better to use the median to replace the missing value.

5. Replacing missing values by an aggregate (4)

Indeed, the mean might be influenced too much by donors that gave exceptionally high donations.

6. Replacing missing values by a fixed value (1)

In other cases, logical reasoning can be used to replace missing values. Consider for instance again the variable sum of donations in the last year. If the donor did not donate in the last year, the value is missing.

7. Replacing missing values by a fixed value (2)

However, we know that it should be zero, so missing values in that variable can be replaced by the fixed value zero. Finally, other more involved techniques to replace missing values exist. For instance, one can predict the missing values using a predictive model that has the other predictive variables as input. However, these techniques are beyond the scope of this course.

8. Replacing missing values in Python

Replacing missing values in a pandas dataframe column is straightforward using the `fillna` method. For instance, to replace the missing values in the column `donations_last_year` by 0, one can call this method on this column with 0 a replacement value. To replace missing values in a column with the mean of the remaining values, one should first calculate this mean value, and then replace it using the `fillna` method.

9. Missing value dummies

In some cases, missing values can have a meaning. This information, the fact that a value is missing, can then be used as a predictive variable. For instance, if someone is not willing to share certain contact details, like e-mail adress, it could mean that this donor is not open to being contacted by the charity organisation, and is hence less likely to respond to a new campaign. Therefore, it is interesting to add a dummy variable as predictive variable that indicates whether a value is missing. In python, you can check whether a value is missing using the equals statement. With list comprehension, you can add a 0 if email is not missing, and 1 otherwise.

10. Let's practice!

Time for you to practice and replace missing values in the basetable.