Get startedGet started for free

Missing data mechanisms

1. Missing data mechanisms

Welcome back! In this lesson, we will discuss an important topic to be considered before handling incomplete data: the missing data mechanisms.

2. Missing Data Mechanisms: overview

Missing data problems can be classified into three categories. Distinguishing between them is vital because each category requires a different solution. There are three so-called missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing not at Random (MNAR). Let's discuss them one by one.

3. Missing Completely at Random (MCAR)

The data are MCAR when the locations of missing values in the dataset are purely random, do not depend on any other data. For example, imagine a weather sensor is measuring temperature and sending the data to a database. There are some missing entries in the database for when the sensor broke down, which happens randomly.

4. Missing at Random (MAR)

The data are MAR when the locations of missing values in the dataset depend on some other, observed data. For instance, say there are some missing temperature values in the database for when the sensor was switched off for maintenance. As the maintenance team never work on the weekends, the locations of missing values depend on the day of the week.

5. Missing not at Random (MNAR)

The data are MNAR when the locations of missing values in the dataset depend on the missing values themselves. Continuing our weather sensor story, imagine that when it's extremely cold, the sensor freezes and stops working. So, it does not record very low temperatures. Thus, the locations of missing values in the temperature variable depend on the values of this variable themselves.

6. Handling the mechanisms

So how do we address these different missingness mechanisms? One way to do so, is to drop incomplete observations. What happens then? If the data are MCAR, the only consequence is an information loss, because while discarding entire rows, we also remove observed data. If the data are MAR or MNAR, however, removing them introduces bias to models built on these data. In this case, missing values should be imputed. As many imputation methods assume MAR, it is very important to be able to detect it!

7. Statistical testing

One way to detect MAR is with statistical testing. Let's discuss testing using the example of a t-test, which tests if the means of two vectors are different. First, we make an assumption about the data, called the null hypothesis: in this case that the means are equal. Then, we compute the test statistic from the data, and then the p-value, which tells us how likely it is to obtain the test statistic that we got, assuming the null hypothesis is true. A small p-value means it was unlikely, so we reject the null - the means are different. Similarly, a large p-value leads us to conclude that the means are equal.

8. Testing for MAR

To test for MAR, we can use the t-test to check if the percentage of missing values in one variable differs for different values of another variable. For instance, let's use the NHANES dataset from the previous lesson to determine whether the percentage of missing values in `PhysActive` is different for males and females. To do this, we first create a dummy variable denoting whether `PhysActive` is missing. Then, we can use the t-test to check if the means of this dummy for males and females are different. If the p-value is small (e.g. < 0.05), the means are different, so the data are MAR.

9. Testing in practice

Let's see it working in practice. First, we create a dummy variable that is TRUE if "PhysActive" is missing and FALSE otherwise. Then, we pull it twice into two vectors: one for males and one for females.

10. Interpreting test results

To perform the t-test on the two vectors, we can use R's built-in "t.test" function that calculates the p-value for us. Here, it amounts to 0.085, so we conclude that the percentage of missing values in "PhysActive" is equal for both genders. For these specific variables, the data is not MAR.

11. Let's practice recognizing missing data mechanisms!

Now that you know about missing data mechanisms, let's put your knowledge to test.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.