Data cleaning and quality checks

1. Data cleaning and quality checks

Welcome back. Real pipeline data rarely arrives clean - it comes with missing values, duplicates, and invalid records. In this lesson, we'll build a cleaning pipeline to handle all of that systematically.

2. Three problems in real data

Real transaction datasets almost always arrive with three kinds of problems. Some rows have missing values in key columns - no Customer ID means no owner for that transaction. Others are exact duplicates from a pipeline that ran more than once. And some records are logically wrong, like a negative transaction amount. We'll tackle all three in this lesson.

3. Why define a schema explicitly?

Before we start cleaning, the data needs to be loaded correctly. Without a schema, Spark samples the CSV to guess column types - and guessing can go wrong. Numeric amounts can be inferred as strings if an early row has a null. PySpark gives us a tool called StructType to define the types ourselves, so Spark applies them exactly, every run.

4. Imports

Let's import what we need. From pyspark.sql.types, we get StructType, StructField, and the individual type classes. We also import functions as F - the standard shorthand for column operations we'll use throughout this lesson.

5. Defining the schema

A StructType holds a list of StructField objects - one per column. Each takes three arguments: the column name, its data type, and whether nulls are allowed. Once the schema is defined, we pass it to .schema() on the reader and point .load() at our CSV file stored in Databricks Volumes. Spark applies our types directly instead of inferring them. printSchema() confirms everything landed correctly.

6. Finding the nulls

With the data loaded, let's find the missing values. This expression iterates over every column: F.col() references the column, .isNull() flags each null as true, and .cast('int') turns those into ones and zeros. F.sum() totals the ones, giving us a null count per column. Printing the results, we can see several columns have between thirty and fifty nulls each.

7. Dropping and filling nulls

We have two strategies. na.drop() with a subset drops any row where that column is null - no Customer_ID means no owner, so those rows go. na.fill() handles the rest with a dictionary of safe defaults: zero for the numeric amount, 'Unknown' for Category and Location. The counts confirm we dropped thirty rows and kept everything else.

8. Handling duplicates

With nulls handled, let's check for duplicates. Before removing anything, we compare the total row count against the distinct count - the difference tells us exactly how many duplicates exist. Here, that's one hundred and forty-nine.

9. Handling duplicates

dropDuplicates() with a subset parameter deduplicates our data by business key columns - here, same customer, same date, same amount means the same transaction. Spark keeps the first occurrence of each unique combination and drops the rest. Looking at the results, we're down from just over a hundred thousand rows to ninety-nine thousand nine hundred and seventy-one.

10. Filtering invalid records

Even with nulls and duplicates removed, some records are still logically wrong. We chain two filters: first, removing anything with a zero or negative amount, then keeping only completed transactions, which drops "Pending" and "Failed" rows. The final count gives us our clean working set of thirty-three thousand rows.

11. Creating derived columns

With clean data, we can enrich it. withColumn() adds a new column without changing anything else. F.when() works like an if statement - fifty thousand or more is High, ten thousand or more is Medium, and .otherwise() catches everything else as Low. Looking at the first three rows, all three bands show up correctly.

12. Data quality check

Before passing the DataFrame downstream, we run a final audit: total rows, duplicates, and the null rate for Customer_ID. Printing the results, zero duplicates, and zero percent nulls confirms the pipeline worked as intended. And the best part: this check is repeatable. Run it against any new data load, and you'll catch problems before they reach your analysis.

13. Let's practice!

We now have a complete cleaning pipeline. Time for practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.