Get startedGet started for free

Introduction to missing data

1. Introduction to missing data

Hi, my name is Nick Tierney, I'm a statistician and writer of the naniar package, which makes it easy to work with missing data, and I'll be your instructor for this course.

2. Introduction

Statistician Gertrude Mary Cox once said: "The best thing to do with missing data is to not have any". True as that is, it is not the world we live in. Working with real world data means working with missing data. To be a great analyst you need to know how to deal with missing values. Understanding how missing data works is important as they can have unexpected effects on your analysis. For example, fitting a linear model on data with missing values deletes chunks of data. This means your decisions aren't based on the right evidence. Replacing missing values, which is called imputation, has to be done very carefully - inserting only the mean can lead to poor estimates and decisions.

3. What will you learn

In this course you will learn about what missing values are, how to find missing data, how to wrangle and tidy missing data, why is data missing, and imputing missing values.

4. Assumed knowledge

For this course I will assume you have basic to intermediate experience with R, experience creating plots using ggplot2, experience using dplyr to manipulate data, and experience fitting linear models in R. In this first chapter we introduce missing values and how to check for and count them.

5. What are missing values?

Before we get started, we need to define missing values Missing values are values that should have been recorded, but were not. Think of it this way: You might accidentally not record seeing a bird - this is a missing value. This is different to recording that there were no birds observed. R stores missing values as NA, which stands for not available.

6. How do I check if I have missing values?

Missing values don't jump out and scream "I'm here!". They're usually hidden, like a needle in a haystack. To detect missing values use any_na, which returns TRUE if there are any missings, and FALSE if there are none. are_na asks "are these NA?" and returns TRUE/FALSE for each value are_na shows us 3 TRUE values - 3 missing values. To avoid counting each TRUE yourself, n_miss counts the number of missings And prop_miss gives the proportion of missings, which gives important context: 50% of data is missing!

7. Working with missing data

So what happens when we mix missing values with our calculations? We need to know what happens, so we can be primed to find these cases. The general rule is this: Calculations with NA return NA. Say you have the height of three friends: Sophie, Dan, and Fred. The sum of their heights returns NA, This is because we don't know the sum of a number and NA.

8. Missing data gotchas

There are some "gotchas" when working with missing data to be aware of: For example, NaN is "Not a Number", and comes from operations like the square root of -1. R actually interprets NaN as a missing value. NULL is an empty value but is not missing. This is subtly different from missing: An empty bucket isn't missing water. Inf is an infinite value, and results from equations like 10 divided by 0 and is not missing.

9. Missing data gotchas (2)

Finally, Beware of conditional statements with missings. For example, NA or TRUE is TRUE. NA or FALSE is NA. NA + NaN is NA. NaN + NA is NaN.

10. Let's practice!

Let's practice!