Validate rows via join
Another example of filtering data is using joins to remove invalid entries. You'll need to verify the folder names are as expected based on a given DataFrame named valid_folders_df
. The DataFrame split_df
is as you last left it with a group of split columns.
The spark
object is available, and pyspark.sql.functions
is imported as F
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Rename the
_c0
column tofolder
on thevalid_folders_df
DataFrame. - Count the number of rows in
split_df
. - Join the two DataFrames on the folder name, and call the resulting DataFrame
joined_df
. Make sure to broadcast the smaller DataFrame. - Check the number of rows remaining in the DataFrame and compare.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Rename the column in valid_folders_df
valid_folders_df = ____
# Count the number of rows in split_df
split_count = ____
# Join the DataFrames
joined_df = split_df.____(____(valid_folders_df), "folder")
# Compare the number of rows remaining
joined_count = ____
print("Before: %d\nAfter: %d" % (split_count, joined_count))