Get Started

Finding patterns in missing data

1. Finding patterns in missing data

As in the previous exercise, you'll continue to further analyze the possible relations that may exist between missing data.

2. Finding correlations between missingness

The two fastest methods to analyze the relations between missingness in the data are Heatmaps or Correlation maps, and Dendrograms. This lesson will cover both these graph plots in detail.

3. Missingness Heatmap

The missingness heatmap seen here describes the correlation of missingness between columns. In simple terms, the columns where the missing values co-occur the maximum are highly related and vice-versa.

4. Missingness Heatmap

Lets understand this in detail using the 'diabetes' data. We'll use the function 'msno.heatmap() of the 'missingno' package to generate the heatmap of the diabetes data. In the graph, the redder the color, the lower the correlation between the missing values of the columns. The bluer it is, the higher the correlation of missingness. You must observe that the columns 'Skin Fold' and 'Serum Insulin' have the highest missingness correlation while 'Skin Fold' and 'BMI' have the least correlation. Now let's move on to creating a dendogram.

5. Missingness Dendrogram

A dendrogram simply put is a tree diagram that groups similar objects in close branches. The missingness dendrogram describes the correlations in missingness by grouping similarly missing columns together. Let's visualize the graph using 'msno.dendrogram()' on the 'diabetes' data.

6. Missingness Dendrogram

To interpret this graph, read it from a top-down perspective. Cluster leaves which are linked together at a distance of zero,

7. Insert title here...

fully predict one another's presence - one variable might always be empty while another is filled, or they might always both be filled or both empty, and so on.

8. Insert title here...

In specific to this graph, 'Skin Fold' and 'Serum Insulin' are highly correlated which is also clear from the heatmap graph and can be mentioned as Missing Not at Random or MNAR.

9. Insert title here...

The missingness of 'Glucose' appears to be similar to 'BMI' than to 'Diastolic_BP'. However, checking its matrix plot from the previous lesson and the number of values only confirms the fact that their correlation is high only because both 'BMI' and 'Glucose' in specific have very few missing values. Hence, 'Glucose' can as well be considered as Missing Completely at Random or MCAR

10. Summary

In this lesson, you learned to create and analyze the missingness heatmap and also the dendrogram.

11. Let's practice!

Now, it's time to practice!