Completeness

1. Completeness

Welcome back! In this lesson, we'll talk about completeness and missing data.

2. What is missing data?

Missing data is one of the most common and most important data cleaning problems. Data is considered "missing" when there is no value stored for a variable in an observation. Missing data is most commonly represented as NA or NaN, but can take on arbitrary values like 0, 99, or a dot.

3. What is missing data?

Like a lot of the problems that we've seen so far, missing data can happen due to technical

4. What is missing data?

and human errors. Missing data can take many forms, so let's take a look at an example.

5. Air quality

Let's take a look at the airquality dataset. It contains ozone, solar radiation, wind, and temperature measurements for different days of the year.

6. Air quality

Here, we have some missing values, which are represented as NA in R.

7. Finding missing values

We can find missing values in a dataset using the is-dot-na function, which returns TRUE if a value is missing and FALSE if it's not, for every value in a data frame.

8. Counting missing values

If we wrap is-dot-na with sum, we get the total number of NAs in the entire dataset. However, this isn't separated by column, so we don't know exactly where our missing values are.

9. Visualizing missing values

That's where visualization comes in. We can use the vis_miss function from the visdat package. This will give us a plot that shows missing values in black, and present values in gray. We can look down a column of the plot to see how many missing values there are in each column, and can look from left to right to get a sense if there are a lot of rows with multiple missing values. Here, the Ozone column has the most missing values. Solar radiation also has some missing values, but not as many. None of the other columns have missing data. It looks like the missingness is pretty random, but let's take a deeper look.

10. Investigating missingness

Let's see if there are any differences between the rows with missing and non-missing ozone values. We'll create a new column called miss_ozone, using is-dot-na to determine if the row is missing the ozone value or not. Then, we group by miss_ozone and use summarize, taking the median of each variable. We also set na-dot-rm to TRUE so that the median function ignores any missing values. The first row has the median of each variable for all rows with non-missing ozone values, and the second row has the median of each variable for all rows with missing ozone. Most of the medians look quite similar, but there's about a thirty degree difference in temperature! This suggests that Ozone is missing for days that had high temperatures.

11. Investigating missingness

If we sort the observations by temperature, then use vis_miss, we can see that all the missing values of Ozone are clustered in last observations, which are the observations with the highest temperatures. Something must break in the sensor when it gets too hot out!

12. Types of missing data

Let's talk about the types of missing data. Data can be missing completely at random, missing at random, or missing not at random.

13. Types of missing data

When data is missing completely at random, there is no pattern to the missingness and no relationship between missing data and any other values. This could happen from something like data entry errors.

14. Types of missing data

When data is missing at random, there is a systematic relationship between missing data and other observed values in the dataset. This is just like what we observed with the air quality data where there was a relationship between missingness and temperature. "Missing at random" is actually a misleading name, since there's nothing random about this type of missing data.

15. Types of missingness

When data is missing not at random, there is a systematic relationship between the missing data and unobserved values. For example, when it's really hot outside, the thermometer might stop working, so we don't have temperature measurements for days with high temperatures. However, we have no way to tell this just from looking at the data since we can't actually see what the missing temperatures are.

16. Dealing with missing data

There are lots of ways to deal with missing data. We can remove any rows that contain missing data. We can also impute, or fill in, missing values using statistical measures or domain knowledge. There are also more complicated algorithmic approaches or ones that require some machine learning. Each missingness type requires a specific approach, and each type of approach has pros and cons. To learn more, check out these courses!

17. Dropping missing values

We can remove rows with missing values by filtering for rows where the Ozone value is not NA and the Solar radiation value is not NA.

18. Replacing missing values

We can replace missing values using mutate combined with ifelse. We create a new column called ozone_filled. If the Ozone value is missing, we use the mean Ozone value. If the Ozone value isn't missing, we use the original value.

19. Let's practice!

Now that you know how to tackle missing data, time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.