Get startedGet started for free

Filter missing values

1. Filter missing values

We've filtered known values. Now let's tackle the unknowns.

2. Missing values

Some of our query results include missing data, one of the biggest challenges in data analysis.

3. Missing values

In fact, we've even counted missing values before. Recall that COUNT(*) counts all records, while COUNT(language) will count only the records where there is a language value. It looks like we have a difference of 11 in our language counts.

4. Missing values

Gaps like this in our data can break calculations, skew averages, or hide patterns. Now, we'll learn techniques to identify, select, or exclude missing values, though fixing or replacing them is beyond the scope of this course.

5. Missing values: NULLs

Missing data can occur for many reasons: customers skip optional fields, systems fail, or values are not yet recorded. In SQL, true missing values are called NULLs, but they can also appear as empty strings or the word "None", for example. Let's use our AI assistant to identify NULLs and other unknown values in our data.

6. IS NULL

The prompt "Show film titles and languages with no language" generates a filter using IS NULL. We need IS NULL, not equals NULL, because NULL doesn't equal anything, not even itself.

7. IS NOT NULL

To find films with a language, we prompt "Show film titles and languages that have a language defined," and the AI uses IS NOT NULL.

8. Identifying other unknowns

Now, here's the catch. These filters only catch true NULL values. In reality, a field might contain other values representing missing data, like "Not available" or "None." These are not true NULL values, but they still represent missing or unknown values.

9. Identifying other unknowns

To catch these, we could return to text and pattern-matching to find exact matches, but we may miss variations.

10. Finding variations

A smarter approach is to explore the field first. The prompt "Show all unique language values" reveals every variation. In this case, only "None" or NULLs appeared.

11. Filtering variations

Once we spot these patterns, we can filter them with a prompt like: "Show film titles and languages where the language is not null or None". This generates a WHERE clause that excludes all missing value variations we've identified.

12. Working with zero

Here's another interesting case. What if the imdb_score field shows 0? Is it missing or a zero score? Our AI assistant cannot answer this for us. We need context from somewhere else, like documentation or business guidelines, to interpret it and decide if it should be filtered. If 0 means missing, then including it in calculations would give us the wrong result.

13. Handling missing values

Identifying missing data variations is an important analysis skill and can impact our conclusions. The key is spotting them correctly. Explore the field to see how missing values appear, then filter as needed. The aggregate functions we've learned ignore NULLs but still include zeros, while COUNT(*) includes every row, even those with NULLs.

14. Ready for practice?

Ready to uncover more data gaps?

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.