Get startedGet started for free

Visualizing missingness across a variable

1. Visualizing missingness across a variable

Previously you learned how to identify patterns between missing variables. In this lesson, you'll move a step further to graphically analyze the relationship between missing values and non-missing values.

2. Missingness across a variable

We'll first start by visualizing the missingness of a variable against another variable. We can use the 'diabetes' dataset to understand this more clearly. The scatterplot of 'Serum_Insulin' and 'BMI' illustrated above shows the non-missing values in purple and the missing values in red.

3. Missingness across a variable

The red points along the y-axis are the missing values of 'Serum_Insulin' plotted against their 'BMI' values.

4. Missingness across a variable

Likewise, the points along the x-axis are the missing values of 'BMI' against their 'Serum_Insulin' values.

5. Missingness across a variable

The bottom-left corner represents the missing values of both 'BMI' and 'Serum_Insulin'.

6. Missingness across a variable

To make sense of this graph, we can see that the missing values of 'Serum_Insulin' are spread throughout the 'BMI' column. Thus, we do not observe any specific correlation between the missingness of 'Serum_Insulin' and 'BMI'. To create this graph, we will use the matplotlib library. However, matplotlib skips all missing values while plotting. Therefore, we would need to first create a function that fills in dummy values for all the missing values in the DataFrame before plotting.

7. Filling dummy Values

To generate dummy values, we can use the 'rand()' function from 'numpy.random'. We first store the number of missing values in 'BMI' to 'num_nulls' and then generate an array of random dummy values of the size 'num_nulls'. The generated dummy values appear as shown beside on the graph. The rand function always outputs values between 0 and 1. However, you must observe that the values of both 'BMI' and 'Skin Fold' are in the range of 100s and 1000s. Hence we'll need to scale and shift the generated dummy values so that they nicely fit into the graph.

8. Filling dummy Values

We can shift the dummy values from 0 and 1 to -2 and -1 by subtracting 2. By doing this we make sure that the dummy values are always below or lesser than the actual values as can be observed from the graph.

9. Filling dummy Values

We then scale to 0.075 the range of BMI. The number 0.075 was chosen after experimenting with various values between 0 and 1. Observe how the dummy values of 'BMI' are a distance apart from the actual values!

10. Filling dummy Values

Hence, we shift these values to the minimum value that is 'BMI.min()'. This will make sure that the dummy values are just below the actual values.

11. Function to fill dummy values

We will create a function 'fill_dummy_values' that fill in all columns in the DataFrame. We use a for loop to produce dummy values for all the columns in a given DataFrame. We can also define the scaling factor so that we can resize the range of dummy values. In addition to the previous steps of scaling and shifting the dummy values, we'll also have to create a copy of the DataFrame to fill in dummy values first. Let's now use this function to create our scatterplot.

12. Generate scatterplot of missing values

We fill the dummy values to 'diabetes_dummy' with the function `fill_dummy_values`. The graph can be plotted with 'diabetes_dummy.plot()' of 'x="Serum_Insulin"', 'y="BMI"', 'kind="scatter"' and 'alpha=0.5' for transparency. The object 'nullity' is the sum (logical OR) of the nullities of 'Serum_Insulin' and 'BMI'. It is a series of True and False values. True implies missing while False implies not missing. The nullity can be used to set the color of the data points with 'cmap="rainbow"' Thus, we obtain the graph that we require.

13. Let's practice!

It's now time for you to solidify these concepts!