MulaiMulai sekarang secara gratis

Validate rows via join

Another example of filtering data is using joins to remove invalid entries. You'll need to verify the folder names are as expected based on a given DataFrame named valid_folders_df. The DataFrame split_df is as you last left it with a group of split columns.

The spark object is available, and pyspark.sql.functions is imported as F.

Latihan ini adalah bagian dari kursus

Cleaning Data with PySpark

Lihat Kursus

Petunjuk latihan

  • Rename the _c0 column to folder on the valid_folders_df DataFrame.
  • Count the number of rows in split_df.
  • Join the two DataFrames on the folder name, and call the resulting DataFrame joined_df. Make sure to broadcast the smaller DataFrame.
  • Check the number of rows remaining in the DataFrame and compare.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

# Rename the column in valid_folders_df
valid_folders_df = ____

# Count the number of rows in split_df
split_count = ____

# Join the DataFrames
joined_df = split_df.____(____(valid_folders_df), "folder")

# Compare the number of rows remaining
joined_count = ____
print("Before: %d\nAfter: %d" % (split_count, joined_count))
Edit dan Jalankan Kode