Session Ready
Exercise

Examining invalid rows

You've successfully filtered out the rows using a join, but sometimes you'd like to examine the data that is invalid. This data can be stored for later processing or for troubleshooting your data sources.

You want to find the difference between two DataFrames and store the invalid rows.

The spark object is defined and pyspark.sql.functions are imported as F. The original DataFrame split_df and the joined DataFrame joined_df are available as they were in their previous states.

Instructions
100 XP
  • Determine the row counts for each DataFrame.
  • Create a DataFrame containing only the invalid rows.
  • Validate the count of the new DataFrame is as expected.
  • Determine the number of distinct folder rows removed.