Get startedGet started for free

Missing values

1. Missing values

Working with messy data means that sooner or later, you'll encounter missing values.

2. Missing values in R

In R, these values are shown as NA, which stands for Not Available. You saw an example of an NA value in a data frame in the previous exercise when we did not have a unit of measurement for whole oranges. While you sometimes just have to live with the presence of missing values, you want to have as few of them as possible since they can break some functions or cause unexpected results. To get rid of missing values, you can either overwrite them or remove observations that contain them.

3. Imputing with a default value: replace_na()

The process of overwriting missing values is called imputation, and the simplest way of doing it by overwriting all NA values in a column with a default value. In the dataset shown here, we have the number of people that have walked on the moon over five years. In datasets with counted data such as this, NA values often imply zero values.

4. Imputing with a default value: replace_na()

We can impute this column with tidyr's replace_na() function. It takes a single argument: a list of column names set equal to the values we want to use for imputation. By using the 0L notation we explicitly tell the function to impute with zero values of the integer data type. Without the L, zero values of the double the data type will be used, which will change the data type of the whole people_on_moon column to double. This is less memory efficient.

5. Imputing with the most recent value: fill()

In most situations, you can't just use a single default value to impute a column. Let's assume that you tried to calculate the total number of people that had visited the moon at a certain point in time. Due to the missing values in the original column, you end up with something like this. You could impute this new column by using the last known value from top to bottom.

6. Imputing with the most recent value: fill()

This is exactly what tidyr's fill() function does by default. All we have to do is pass it the name of the column to impute.

7. fill() imputation options

The direction of the imputation can be changed with the dot-direction argument. Here it is set to the default, "down" which gives the correct result.

8. fill() imputation options

When we set it to "up", we can see that values are now imputed upwards.

9. Removing rows with missing values: drop_na()

If we want to get rid of the rows with missing values, we can use the drop_na() function. By default, it will look at all variables and remove all observations that have at least one missing value.

10. drop_na() caveats

This is a bit of a brute force approach which can sometimes cause you to lose too much data. For example, let's add a column for the number of people on Mars in those years, which contains nothing but missing values.

11. drop_na() caveats

Calling the drop_na() function would make us lose all data.

12. drop_na() caveats

We can solve this by passing the drop_na() function the names of the columns in which you want it to search for missing values. It will then ignore all other columns.

13. Let's practice!

Now it's your turn, let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.