Per image count

Your next task in building a data pipeline for this dataset is to create a few analysis oriented columns. You've been asked to calculate the number of dogs found in each image based on your dog_list column created earlier. You have also created the DogType which will allow better parsing of the data within some of the data columns.

The joined_df is available as you last defined it, and the DogType StructType is defined. pyspark.sql.functions is available under the F alias.

Create a Python function to split each entry in dog_list to its appropriate parts. Make sure to convert any strings into the appropriate types or the DogType will not parse correctly.
Create a UDF using the above function.
Use the UDF to create a new column called dogs. Drop the previous column in the same command.
Show the number of dogs in the new column for the first 10 rows.

DataFrame details

Manipulating DataFrames in the real world

Improving Performance

Complex processing and data pipelines

Exercise

Per image count

Instructions