1. Learn
  2. /
  3. Courses
  4. /
  5. Cleaning Data with PySpark

Exercise

Using user defined functions in Spark

You've seen some of the power behind Spark's built-in string functions when it comes to manipulating DataFrames. However, once you reach a certain point, it becomes difficult to process the data in a without creating a rat's nest of function calls. Here's one place where you can use User Defined Functions to manipulate our DataFrames.

For this exercise, we'll use our voter_df DataFrame, but you're going to replace the first_name column with the first and middle names.

The pyspark.sql.functions library is available under the alias F. The classes from pyspark.sql.types are already imported.

Instructions

100 XP
  • Edit the getFirstAndMiddle() function to return a space separated string of names, except the last entry in the names list.
  • Define the function as a user-defined function. It should return a string type.
  • Create a new column on voter_df called first_and_middle_name using your UDF.
  • Show the Data Frame.