BaşlayınÜcretsiz Başlayın

Per image count

Your next task in building a data pipeline for this dataset is to create a few analysis oriented columns. You've been asked to calculate the number of dogs found in each image based on your dog_list column created earlier. You have also created the DogType which will allow better parsing of the data within some of the data columns.

The joined_df is available as you last defined it, and the DogType StructType is defined. pyspark.sql.functions is available under the F alias.

Bu egzersiz

Cleaning Data with PySpark

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Create a Python function to split each entry in dog_list to its appropriate parts. Make sure to convert any strings into the appropriate types or the DogType will not parse correctly.
  • Create a UDF using the above function.
  • Use the UDF to create a new column called dogs.
  • Show the number of dogs in the new column for the first 10 rows.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Create a function to return the number and type of dogs as a tuple
def dogParse(doglist):
  dogs = []
  for dog in doglist:
    (breed, start_x, start_y, end_x, end_y) = dog.____('____')
    dogs.append((____, int(____), ____, ____, ____))
  return dogs

# Create a UDF
udfDogParse = ____(____, ArrayType(____))

# Use the UDF to list of dogs
joined_df = joined_df.____('____', ____('____'))

# Show the number of dogs in the first 10 rows
joined_df.____(____('____')).____(____)
Kodu Düzenle ve Çalıştır