1. Handling missing data
In this chapter, we will focus on different types of messy data that we may encounter. This lesson will focus on missing values.
2. Missing data (an example)
Let's return to the restaurant inspection dataset. It is apparent that for some columns no value is provided.
Take this sample of records for example. The second record lacks a score. This is an example of missing data.
Missing data can have different representations.
It is common to use the special NULL value for representing a missing value for a record.
String columns can contain NULL values but may also use the empty string ('') to represent a missing value.
3. Causes of missing data
You might be wondering what causes column values to be missing.
Data can be missing due to human error. For example, a newly hired restaurant inspector may fail to register a score for certain types of inspections.
Systematic issues in how data is collected may cause values to be missing. As an example, a failed replication process could result in some inspection scores not being updated across databases.
4. Types of missing data
Missing data can be classified into different types: missing completely at random (or MCAR), missing at random (or MAR), and missing not at random (or MNAR).
5. Types of missing data
Missing data labeled as MCAR indicates that no systematic relationship exists between present and missing values. You would label missing data due to a human data entry error as an instance of MCAR.
6. Types of missing data
Missing data labeled as MAR indicates that a systematic relationship exists between missing data and other observed values. That is, values that are present can explain missing values.
7. Types of missing data
In the example given earlier, a systematic reason explains the missing score. "Administrative Miscellaneous / Re-inspection" inspection types do not receive scores. Therefore, the missing score in row 2 receives the MAR classification.
8. Types of missing data
The final type of missing data is MNAR. Missing data is labeled MNAR when a systematic relationship between missing data and unobserved values exists. Consider a particular inspector who fails to enter inspection scores after an inspection. In this case, the inspector's identity is related to the missing inspection score but the inspector's identity is not part of the record.
9. Identifying missing data
Typically, it is not practical to visually scan a table to identify missing data.
We can use SELECT statements to identify missing values in a column of interest.
If we want to know the number of records with missing data, the COUNT() function can be used to quickly compute this value.
10. Identifying missing data
We can investigate relationships between columns concerning missing data by counting NULL records in one column while grouping by the values in another.
The query shown here provides evidence that "Administrative Miscellaneous" inspections do not receive inspection scores.
11. Rectifying missing data
This leads us to the next challenge. When data is missing, what do we do about it?
When possible, we want to identify the missing value. Given a restaurant's street address and borough, we can identify the missing zip code of an inspection record.
Tracking down missing data is not always feasible.
Furthermore, fixing the missing data may not be worthwhile.
Using a fill-in value is often a useful solution. The average or median of the existing values in a numerical column can be a good candidate for the fill-in value. We will explore options for fill-in values in string columns in the exercises.
We can exclude records in cases where the data cannot be obtained, a fill-in value is not appropriate, and the value that is missing is crucial for some analysis. However, this approach should only be used as a last resort.
12. Replacing missing values with COALESCE()
PostgreSQL provides a convenient function for situations where displaying an actual value is preferable to displaying a NULL value.
The COALESCE() function accepts a variable number of arguments as inputs and outputs the value of the first non-NULL argument.
In the case that we want to display -1 (instead of a NULL score) for our restaurant inspection results, we can use COALESCE().
13. Replacing missing values with COALESCE()
This query replaces the NULL value in the score column of the 2nd row with a -1.
14. Let's practice!
Now that we have spent some time exploring how to handle missing data, let's put this knowledge to work with some practice exercises!