1. When and how to delete missing data
In the previous lessons, you learned how to analyze the missingness of variables in detail. Now, you'll decide on how to appropriately act on the missing values. That is, whether to keep them, impute them or simply delete them. You'll also learn the types of deletions that you can do as well as when you can delete and not delete.
2. Types of deletions
The two types of deletions that are used are pairwise deletion and list-wise deletion.
It must be noted that both these deletions are used only when the values are missing completely at random that is MCAR.
In pairwise deletions, only the missing values are skipped during calculations whereas, in list-wise deletion, the complete row is deleted.
3. Pairwise Deletion
To illustrate, in pairwise deletion, you can simply skip the missing values while operating like summing the column or finding it's mean.
For instance, using diabetes['Glucose'].mean() skips all the missing values while computing. To confirm this, we can use count which gives us 763 values and divide by the sum of the column which returns the same values 121.687.
Likewise, all operations in pandas intrinsically skip the missing values which is equivalent to pairwise deletion.
Pairwise deletions minimize the amount of data loss and are hence preferred. However, it is also true that at several instances they might negatively affect our analysis.
4. Listwise Deletion or Complete Case
In listwise deletions, the whole row is dropped as shown in the above figure. Hence, it is also called complete case analysis.
To achieve this, we can use the '.dropna()' and set 'how="any"' to delete the row even if only one of the values is missing. Setting 'subset=["Glucose"]' only checks for the missing values in glucose and deletes those rows.
The major disadvantage of listwise deletions is amount of the data lost. Hence, it is recommended to use it only when the number of missing values is very small.
5. Deletion in diabetes DataFrame
To get a clearer picture, we can start by using missingness matrix on the 'diabetes' data.
Looking at the Glucose column, the number of missing values are only 5. Hence, it confirms that the values are missing completely at random or MCAR. So we can go ahead and perform listwise deletion.
6. Deletion in diabetes DataFrame
Once we drop using 'diabetes.dropna()', we can again visualize to confirm that the rows have been deleted.
7. Deletion in diabetes DataFrame
Similarly, even 'BMI' which appears to show a small correlation is not substantial to prove as the number of missing values is only 11.
Hence, let's delete the rows where 'BMI' is missing and confirm using the missingness matrix.
8. Summary
In this lesson, you learned pairwise and listwise deletions as well as when to use them.
9. Let's practice!
It's now time for you to practice.