Using user defined functions in Spark
You've seen some of the power behind Spark's built-in string functions when it comes to manipulating DataFrames. However, once you reach a certain point, it becomes difficult to process the data in a without creating a rat's nest of function calls. Here's one place where you can use User Defined Functions to manipulate our DataFrames.
For this exercise, we'll use our voter_df DataFrame, but you're going to replace the first_name column with the first and middle names.
The pyspark.sql.functions library is available under the alias F. The classes from pyspark.sql.types are already imported.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Edit the
getFirstAndMiddle()function to return a space separated string of names, except the last entry in the names list. - Define the function as a user-defined function. It should return a string type.
- Create a new column on
voter_dfcalledfirst_and_middle_nameusing your UDF. - Show the Data Frame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def getFirstAndMiddle(names):
# Return a space separated string of names
return ' '.join(____)
# Define the method as a UDF
udfFirstAndMiddle = F.____(____, ____)
# Create a new column using your UDF
voter_df = voter_df.withColumn('first_and_middle_name', ____(____))
# Show the DataFrame
____