Modifying DataFrame columns
Previously, you filtered out any rows that didn't conform to something generally resembling a name. Now based on your earlier work, your manager has asked you to create two new columns - first_name
and last_name
. She asks you to split the VOTER_NAME
column into words on any space character. You'll treat the last word as the last_name
, and all other words as the first_name
. You'll be using some new functions in this exercise including .split()
, .size()
, and .getItem()
. The .getItem(index)
takes an integer value to return the appropriately numbered item in the column. The functions .split()
and .size()
are in the pyspark.sql.functions
library.
Please note that these operations are always somewhat specific to the use case. Having your data conform to a format often matters more than the specific details of the format. Rarely is a data cleaning task meant just for one person - matching a defined format allows for easier sharing of the data later (ie, Paul doesn't need to worry about names - Mary already cleaned the dataset).
The filtered voter DataFrame from your previous exercise is available as voter_df
. The pyspark.sql.functions
library is available under the alias F
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Add a new column called
splits
holding the list of possible names. - Use the
getItem()
method and create a new column calledfirst_name
. - Get the last entry of the
splits
list and create a column calledlast_name
. - Drop the
splits
column and show the newvoter_df
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Add a new column called splits separated on whitespace
voter_df = voter_df.withColumn(____, F.____(voter_df.VOTER_NAME, '\s+'))
# Create a new column called first_name based on the first item in splits
voter_df = voter_df.withColumn(____, voter_df.splits.getItem(____)
# Get the last entry of the splits list and create a column called last_name
voter_df = voter_df.withColumn(____, voter_df.splits.getItem(F.____('splits') - ____))
# Drop the splits column
voter_df = voter_df.____('splits')
# Show the voter_df DataFrame
voter_df.show()