Get startedGet started for free

Further parsing

You've molded this dataset into a significantly different format than it was before, but there are still a few things left to do. You need to prep the column data for use in later analysis and remove a few intermediary columns.

The spark context is available and pyspark.sql.functions is aliased as F. The types from pyspark.sql.types are already imported. The split_df DataFrame is as you last left it. Remember, you can use .printSchema() on a DataFrame in the console area to view the column names and types.

⚠️ Note: If you see an AttributeError, refresh the exercises and click Run Solution without clicking Run Code.

This exercise is part of the course

Cleaning Data with PySpark

View Course

Exercise instructions

  • Create a new function called retriever that takes two arguments, the split columns (cols) and the total number of columns (colcount). This function should return a list of the entries that have not been defined as columns yet (i.e., everything after item 4 in the list).
  • Define the function as a Spark UDF, returning an Array of strings.
  • Create the new column dog_list using the UDF and the available columns in the DataFrame.
  • Remove the columns _c0, colcount, and split_cols.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

def retriever(____, ____):
  # Return a list of dog data
  return ____[4:____]

# Define the method as a UDF
udfRetriever = ____(____, ArrayType(____))

# Create a new column using your UDF
split_df = split_df.withColumn('dog_list', ____(____, ____))

# Remove the original column, split_cols, and the colcount
split_df = split_df.drop('____').____('____').____('____')
Edit and Run Code