Exercise

Grouped summary statistics

In this exercise, we are going to combine the .groupBy() and .filter() methods that you've used previously to calculate the min() and avg() number of users that have rated each song, and the min() and avg() number of songs that each user has rated.

Because our data now includes 0's for items not yet consumed, we'll need to .filter() them out when doing grouped summary statistics like this. The msd dataset is provided for you here. The col(), min(), and avg() functions from pyspark.sql.functions have been imported for you.

Instructions

100 XP
  • As an example, the .filter(), .groupBy() and .count() methods are applied to the msd dataset along with .select() and min() to return the smallest number of ratings that any song in the dataset has received. Use this as a model to calculate the avg() number of implicit ratings the songs in msd have received.
  • Using the same model, find the min() and avg() number of implicit ratings that userIds have provided in the msd dataset.