Practicing array column
The SQL function udf
is available, as well as a dataframe df_before
is available, of type DataFrame[doc: array<string>, in: array<string>, out: array<string>]
.
The TRIVIAL_TOKENS
variable is a set. It contains certain words that we want to remove.
This exercise is part of the course
Introduction to Spark SQL in Python
Exercise instructions
- Show the rows of
df_before
wheredoc
contains the item5
. - Create a udf that removes items in
TRIVIAL_TOKENS
from an array column. The order does not need to be preserved. - Remove tokens from the
in
andout
columns indf2
that appear inTRIVIAL_TOKENS
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Show the rows where doc contains the item '5'
df_before.where(array_contains('doc', '____')).show()
# UDF removes items in TRIVIAL_TOKENS from array
rm_trivial_udf = udf(lambda x:
list(set(x) - ____) if x
else x,
ArrayType(____()))
# Remove trivial tokens from 'in' and 'out' columns of df2
df_after = df_before.withColumn('in', ____('in'))\
.withColumn('out', ____('out'))
# Show the rows of df_after where doc contains the item '5'
df_after.where(array_contains('doc','5')).show()