MulaiMulai sekarang secara gratis

Filtering column content with Python

You've looked at using various operations on DataFrame columns - now you can modify a real dataset. The DataFrame voter_df contains information regarding the voters on the Dallas City Council from the past few years. This truncated DataFrame contains the date of the vote being cast and the name and position of the voter. Your manager has asked you to clean this data so it can later be integrated into some desired reports. The primary task is to remove any null entries or odd characters and return a specific set of voters where you can validate their information.

This is often one of the first steps in data cleaning - removing anything that is obviously outside the format. For this dataset, make sure to look at the original data and see what looks out of place for the VOTER_NAME column.

The pyspark.sql.functions library is already imported under the alias F.

Latihan ini adalah bagian dari kursus

Cleaning Data with PySpark

Lihat Kursus

Petunjuk latihan

  • Show the distinct VOTER_NAME entries.
  • Filter voter_df where the VOTER_NAME is 1-20 characters in length.
  • Filter out voter_df where the VOTER_NAME contains an _.
  • Show the distinct VOTER_NAME entries again.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

# Show the distinct VOTER_NAME entries
voter_df.select(____).distinct().show(40, truncate=False)

# Filter voter_df where the VOTER_NAME is 1-20 characters in length
voter_df = ____('length(VOTER_NAME) > 0 and length(VOTER_NAME) < 20')

# Filter out voter_df where the VOTER_NAME contains an underscore
voter_df = voter_df.filter(~ F.col('VOTER_NAME').____)

# Show the distinct VOTER_NAME entries again
voter_df.____(____).____().____(40, truncate=False)
Edit dan Jalankan Kode