IniziaInizia gratis

Validate rows via join

Another example of filtering data is using joins to remove invalid entries. You'll need to verify the folder names are as expected based on a given DataFrame named valid_folders_df. The DataFrame split_df is as you last left it with a group of split columns.

The spark object is available, and pyspark.sql.functions is imported as F.

Questo esercizio fa parte del corso

Cleaning Data with PySpark

Visualizza il corso

Istruzioni dell'esercizio

  • Rename the _c0 column to folder on the valid_folders_df DataFrame.
  • Count the number of rows in split_df.
  • Join the two DataFrames on the folder name, and call the resulting DataFrame joined_df. Make sure to broadcast the smaller DataFrame.
  • Check the number of rows remaining in the DataFrame and compare.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Rename the column in valid_folders_df
valid_folders_df = ____

# Count the number of rows in split_df
split_count = ____

# Join the DataFrames
joined_df = split_df.____(____(valid_folders_df), "folder")

# Compare the number of rows remaining
joined_count = ____
print("Before: %d\nAfter: %d" % (split_count, joined_count))
Modifica ed esegui il codice