Get startedGet started for free

Filling down missing values

1. Missing, missing data

So far we've talked about how to handle, explore, and search for and replace missing values - these are values that are explicitly missing. In this lesson we are going to talk about how to efficiently handle implicit missing values. These are values that are implied to be missing, but might not be explicitly listed. They are in fact, missing, missing values.

2. Another perspective on missing data

So we've covered how to find missing values masquerading as real values. Now, let's talk about those that aren't even in the data. That's right, missing, missing values. (dun, dun DUNNNNN!) Let's say we have tetris scores for three friends, Robin, Sam, and Blair. These are recorded in the morning, afternoon, and evening. Do you notice something different about one of my friends? There are only two recorded measurements for Sam - he has morning and afternoon. Sam is missing a score evening! Well, it's not recorded as missing, but it is not there! This becomes clearer if we spread out the data, so that we have one column for afternoon, evening, and morning. Notice how there is now a missing value for the evening for Sam? The missing value here wasn't shown before, because it was actually...missing! So how does that work?

3. Explicit and implicit missing values

One way to think of missing values in a dataset is that they are either represented explicitly - they are missing with NA, or they are represented implicitly - they are not shown in the data, but implied.

4. Making implicit missings explicit

To make implicitly missing values explicit, we can use the complete function from tidyr on our tetris dataset. Here, you take the data then use the complete function, passing it the variables that you want it to find unique combinations of. In this case, we are interested in name and time. We can see now that this produces an "evening" time slot for sam - and an NA value.

5. Handling explicitly missing values

Sometimes missing data is entered to help make a dataset more readable. For example, imagine if we had the following structure for our tetris data: Robin NA NA, Sam NA NA, Blair NA NA. We know something about the data structure here, and what we want to do is fill these values down.

6. Handling explicitly missing values

Filling these values down This is one of those things that looks simple, it can be somewhat difficult to program ourselves. Luckily, we can use the fill function from tidyr. To use it, we use the column names that you want to fill down - in our case name. And then the missing values will be replaced by the previous present value. This method of filling in missing values is referred to as "last observation carried forward" and is sometimes abbreviated as "locf".

7. A Warning

Handling missing values in this way is useful in cases where there is some structure to the data that makes it easy to know what the missing value is. So make sure you are careful with this technique, it only solves a few missing data problems.

8. Let's practice!

Now it's your turn.