Examining invalid rows

You've successfully filtered out the rows using a join, but sometimes you'd like to examine the data that is invalid. This data can be stored for later processing or for troubleshooting your data sources.

You want to find the difference between two DataFrames and store the invalid rows.

The spark object is defined and pyspark.sql.functions are imported as F. The original DataFrame split_df and the joined DataFrame joined_df are available as they were in their previous states.

This exercise is part of the course

Cleaning Data with PySpark

View Course

Exercise instructions

Determine the row counts for each DataFrame.
Create a DataFrame containing only the invalid rows.
Validate the count of the new DataFrame is as expected.
Determine the number of distinct folder rows removed.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Determine the row counts for each DataFrame
split_count = ____
joined_count = ____

# Create a DataFrame containing the invalid rows
invalid_df = split_df.____(____(joined_df), '____', '____')

# Validate the count of the new DataFrame is as expected
invalid_count = ____
print(" split_df:\t%d\n joined_df:\t%d\n invalid_df: \t%d" % (split_count, joined_count, invalid_count))

# Determine the number of distinct folder rows removed
invalid_folder_count = invalid_df.____('____').____.____
print("%d distinct invalid folders found" % invalid_folder_count)

Edit and Run Code