1. Learn
  2. /
  3. Courses
  4. /
  5. Cleaning Data with PySpark

Exercise

Validate rows via join

Another example of filtering data is using joins to remove invalid entries. You'll need to verify the folder names are as expected based on a given DataFrame named valid_folders_df. The DataFrame split_df is as you last left it with a group of split columns.

The spark object is available, and pyspark.sql.functions is imported as F.

Instructions

100 XP
  • Rename the _c0 column to folder on the valid_folders_df DataFrame.
  • Count the number of rows in split_df.
  • Join the two DataFrames on the folder name, and call the resulting DataFrame joined_df. Make sure to broadcast the smaller DataFrame.
  • Check the number of rows remaining in the DataFrame and compare.