Examining invalid rows
You've successfully filtered out the rows using a join, but sometimes you'd like to examine the data that is invalid. This data can be stored for later processing or for troubleshooting your data sources.
You want to find the difference between two DataFrames and store the invalid rows.
The spark
object is defined and pyspark.sql.functions
are imported as F
. The original DataFrame split_df
and the joined DataFrame joined_df
are available as they were in their previous states.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Determine the row counts for each DataFrame.
- Create a DataFrame containing only the invalid rows.
- Validate the count of the new DataFrame is as expected.
- Determine the number of distinct folder rows removed.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Determine the row counts for each DataFrame
split_count = ____
joined_count = ____
# Create a DataFrame containing the invalid rows
invalid_df = split_df.____(____(joined_df), '____', '____')
# Validate the count of the new DataFrame is as expected
invalid_count = ____
print(" split_df:\t%d\n joined_df:\t%d\n invalid_df: \t%d" % (split_count, joined_count, invalid_count))
# Determine the number of distinct folder rows removed
invalid_folder_count = invalid_df.____('____').____.____
print("%d distinct invalid folders found" % invalid_folder_count)