U define it? U use it!

1. U define it? U use it!

Now, we’ll take a look at User-Defined Functions, or UDFs, in PySpark. By the end of this video, we’ll have a thorough understanding of what UDFs are and how to create them.

2. UDFs for repeatable tasks

A UDF is a custom function we create to work with data using PySpark DataFrames. We'll discuss two main kinds of UDFs: PySpark UDFs and pandas UDFs. The main difference is the size of the dataset they are designed to handle. There are some differences in execution, but they operate very similarly! They are reuseable and repeatable, because they are registered directly to the `Spark Session`. We have two types of UDFs, Pandas UDFs for larger datasets, and PySpark UDFs for smaller datasets. This may seem counterintuitive but it is due to how each type of UDF handles data on a row by row basis. It's outside our scope here but well worth exploring as we get more PySpark experience.

3. Defining and registering a UDF

Now, how do we create a UDF? Let's see how it's done. First, we create a regular Python function called `to_upper_case()` to convert all strings in a column to upper case. Next, we register the function as a UDF using PySpark's `.udf()` function and pass the correct data type method. We're doing String operations, so we use `StringType()`. Without registering, the UDF would not be available to all the worker nodes of the Spark Session. Lastly, we apply it to the DataFrame df and show the results.

4. pandas UDF

While PySpark UDFs are incredibly useful, they can also introduce performance overhead because PySpark has to do a series of inefficient conversions, causing frustrating performance problems. To mitigate this, PySpark has introduced pandas UDFs. Here's an example of a pandas UDF, which is defined as a `pandas_udf()`. Two things to notice here. 1) we have to import the `pandas_udf()` function. Second, we have to use the decorator @ to define the data type using `@pandas_udf("float")`, which is a way to take a function as an argument. We also don't need to register it to the Spark Session.

5. PySpark UDFS vs. pandas UDFs

So when should we use PySpark UDFs, and when should we opt for pandas UDFs? If we are working with small datasets or simple transformations, PySpark UDFs will suffice. We use them on the columnar level and register directly with the `SparkSession`. Registering to a SparkSession will make the UDF work with all nodes of the Spark cluster. However, for large datasets, pandas UDFs are preferred due to their superior performance at scale and incorporate the code outside the `SparkSession`. There are many scenarios where we would consider a PySpark UDF over a pandas UDF, but they involve a cost benefit analysis of compute cost, development environment, data size and data type that are outside the scope of this course.

6. Let's practice!

Let's go practice UDFs!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.