Inizia subitoInizia gratis

Validate rows via join

Another example of filtering data is using joins to remove invalid entries. You'll need to verify the folder names are as expected based on a given DataFrame named valid_folders_df. The DataFrame split_df is as you last left it with a group of split columns.

The spark object is available, and pyspark.sql.functions is imported as F.

Questo esercizio fa parte del corso

Cleaning Data with PySpark

Visualizza corso

Istruzioni dell'esercizio

  • Rename the _c0 column to folder on the valid_folders_df DataFrame.
  • Count the number of rows in split_df.
  • Join the two DataFrames on the folder name, and call the resulting DataFrame joined_df. Make sure to broadcast the smaller DataFrame.
  • Check the number of rows remaining in the DataFrame and compare.

esercizio interattivo pratico

Prova questo esercizio completando questo codice di esempio.

# Rename the column in valid_folders_df
valid_folders_df = ____

# Count the number of rows in split_df
split_count = ____

# Join the DataFrames
joined_df = split_df.____(____(valid_folders_df), "folder")

# Compare the number of rows remaining
joined_count = ____
print("Before: %d\nAfter: %d" % (split_count, joined_count))
Modifica ed esegui il codice