Further parsing
You've molded this dataset into a significantly different format than it was before, but there are still a few things left to do. You need to prep the column data for use in later analysis and remove a few intermediary columns.
The spark
context is available and pyspark.sql.functions
is aliased as F
. The types from pyspark.sql.types
are already imported. The split_df
DataFrame is as you last left it. Remember, you can use .printSchema()
on a DataFrame in the console area to view the column names and types.
⚠️ Note: If you see an AttributeError
, refresh the exercises and click Run Solution without clicking Run Code.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Create a new function called
retriever
that takes two arguments, the split columns (cols) and the total number of columns (colcount). This function should return a list of the entries that have not been defined as columns yet (i.e., everything after item 4 in the list). - Define the function as a Spark UDF, returning an Array of strings.
- Create the new column
dog_list
using the UDF and the available columns in the DataFrame. - Remove the columns
_c0
,colcount
, andsplit_cols
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def retriever(____, ____):
# Return a list of dog data
return ____[4:____]
# Define the method as a UDF
udfRetriever = ____(____, ArrayType(____))
# Create a new column using your UDF
split_df = split_df.withColumn('dog_list', ____(____, ____))
# Remove the original column, split_cols, and the colcount
split_df = split_df.drop('____').____('____').____('____')