Filtering column content with Python
You've looked at using various operations on DataFrame columns - now you can modify a real dataset. The DataFrame voter_df
contains information regarding the voters on the Dallas City Council from the past few years. This truncated DataFrame contains the date of the vote being cast and the name and position of the voter. Your manager has asked you to clean this data so it can later be integrated into some desired reports. The primary task is to remove any null entries or odd characters and return a specific set of voters where you can validate their information.
This is often one of the first steps in data cleaning - removing anything that is obviously outside the format. For this dataset, make sure to look at the original data and see what looks out of place for the VOTER_NAME
column.
The pyspark.sql.functions
library is already imported under the alias F
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Show the distinct
VOTER_NAME
entries. - Filter
voter_df
where theVOTER_NAME
is 1-20 characters in length. - Filter out
voter_df
where theVOTER_NAME
contains an_
. - Show the distinct
VOTER_NAME
entries again.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Show the distinct VOTER_NAME entries
voter_df.select(____).distinct().show(40, truncate=False)
# Filter voter_df where the VOTER_NAME is 1-20 characters in length
voter_df = ____('length(VOTER_NAME) > 0 and length(VOTER_NAME) < 20')
# Filter out voter_df where the VOTER_NAME contains an underscore
voter_df = voter_df.filter(~ F.col('VOTER_NAME').____)
# Show the distinct VOTER_NAME entries again
voter_df.____(____).____().____(40, truncate=False)