Practicing creating a UDF
Sometimes your data needs a transformation that is not supported by built-in functions. This is where a custom user defined function ("UDF") is suitable.
The SQL function udf
is available.
A dataframe df2
is available, of type DataFrame[doc: array<string>, in: array<string>, out: array<string>]
. It's doc
column contains trivial tokens.
The following displays the first 20 rows of df2 where doc contains '1':
df2.where(array_contains('doc','1')).show()
You have two objectives to fulfill:
- Ensure that the transformed data consists of nonempty vectors.
- A dataframe has a column that contains arrays of string, where each array has a single item. You'd like to transform this column to a string.
This exercise is part of the course
Introduction to Spark SQL in Python
Exercise instructions
- Create a udf that returns true if and only if the value is a nonempty vector, using
numNonzeros()
- Create a udf that returns the first element of the array and returns its string representation.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Returns true if the value is a nonempty vector
nonempty_udf = udf(lambda x:
True if (x and hasattr(x, "toArray") and x.____())
else False, ____())
# Returns first element of the array as string
s_udf = udf(lambda x: ____(x[0]) if (x and type(x) is list and len(x) > 0)
else '', ____())