Grouped summary statistics
In this exercise, we are going to combine the .groupBy() and .filter() methods that you've used previously to calculate the min() and avg() number of users that have rated each song, and the min() and avg() number of songs that each user has rated.
Because our data now includes 0's for items not yet consumed, we'll need to .filter() them out when doing grouped summary statistics like this. The msd dataset is provided for you here. The col(), min(), and avg() functions from pyspark.sql.functions have been imported for you.
Deze oefening maakt deel uit van de cursus
Building Recommendation Engines with PySpark
Oefeninstructies
- As an example, the
.filter(),.groupBy()and.count()methods are applied to themsddataset along with.select()andmin()to return the smallest number of ratings that any song in the dataset has received. Use this as a model to calculate theavg()number of implicit ratings the songs inmsdhave received. - Using the same model, find the
min()andavg()number of implicit ratings thatuserIds have provided in themsddataset.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Min num implicit ratings for a song
print("Minimum implicit ratings for a song: ")
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(min("count")).show()
# Avg num implicit ratings per songs
print("Average implicit ratings per song: ")
____.filter(____("____") > 0).groupBy("____").count().____(avg("____")).____()
# Min num implicit ratings from a user
print("Minimum implicit ratings from a user: ")
msd.____(____("num_plays") > ____).____("userId").____().select(____("____")).____()
# Avg num implicit ratings for users
print("Average implicit ratings per user: ")
____.filter(col("num_plays") > 0).____("____").____().____(____("____")).____()