Is the data missing at random?
1. Is the data missing at random?
Welcome back to the course! Is data missing at random or does it have a pattern? Turns out missingness has a pattern, and often for a good reason! Let's explore some of the possible reasons for missing data.2. Possible reasons for missing data
One of the obvious reasons is that data is simply missing at random. Now, the other reasons might be that the missingness is dependent on another variable. Or due to missingness of the same variables or other variables. These are a few of the reasons why data might be missing.3. Types of missingness
You can group the missingness patterns into 3 broad categories. They are: Missing Completely at Random, Missing at Random, and Missing Not at Random. Identifying the missingness type helps narrow down the methodologies that you can use for treating missing data. Let's understand each one in detail with an example.4. Missing Completely at Random(MCAR)
Missing completely at random. It implies that missingness has no relationship between any values observed or missing. What does this mean? Consider you have a class of students. There are a few students absent on any given day. The students are absent just randomly for their specific reasons. This is missing completely at random. Let's analyze the same with the diabetes data which contains the lab test observations of patients.5. MCAR - An example
Plotting the missingness matrix plot, you can observe that the 'Glucose' column has values missing at random and does not have any correlation as no clear pattern exists. The correlation here implies the dependency of missing values on another variable present or absent. This is missing completely at random or MCAR. The column 'BMI' can also be categorized as MCAR since only a few values are missing even though they appear to be slightly correlated with 'Diastolic_BP'.6. Missing at Random(MAR)
The next category is Missing at random. From the definition, "there is a systematic relationship between missingness and other observed data but not the missing data". Consider the attendance in a classroom of students during winter where many students are absent due to the bad weather. Although this might be at random, the hidden cause might be that students sitting closer might have contracted a fever. Missing at random means that there might exist a relationship with another variable. Here the attendance is slightly correlated to the season of the year. It's important here to note that missingness is dependent only on the observed values and not the missing values for MAR.7. MAR - An example
From the missingness plot, find that there are many missing values in the column 'Diastolic_BP'. This is a typical case of MAR, as there might be a reason for missingness that cannot be directly observed.8. Missing not at Random(MNAR)
Lastly, the case Missing Not at Random. From the definition, there is a relationship between missingness and its values. For instance, in our class of students, it is Sally's birthday. Sally and many of her friends are absent to attend her birthday party. This is not at all random as Sally and only her friends are absent.9. MNAR - An example
In the missingness summary of diabetes data, there is a strong correlation between the missingness of 'Skin_Fold' and 'Serum_Insulin' which is easily interpretable by sorting the DataFrame on 'Skin_Fold'.10. Summary
So, what have we learned so far? We have understood the possible reasons for missingness. You can categorize them into Missing Completely at Random, Missing at Random or Missing Not at Random. We explored how to detect missing value patterns by performing operations on the variables. Then visualizing them. We now know how to map and justify the missingness pattern.11. Let's practice!
Now, let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.