Grouped summary statistics
In this exercise, we are going to combine the .groupBy()
and .filter()
methods that you've used previously to calculate the min()
and avg()
number of users that have rated each song, and the min()
and avg()
number of songs that each user has rated.
Because our data now includes 0's for items not yet consumed, we'll need to .filter()
them out when doing grouped summary statistics like this. The msd
dataset is provided for you here. The col()
, min()
, and avg()
functions from pyspark.sql.functions
have been imported for you.
Cet exercice fait partie du cours
Building Recommendation Engines with PySpark
Instructions
- As an example, the
.filter()
,.groupBy()
and.count()
methods are applied to themsd
dataset along with.select()
andmin()
to return the smallest number of ratings that any song in the dataset has received. Use this as a model to calculate theavg()
number of implicit ratings the songs inmsd
have received. - Using the same model, find the
min()
andavg()
number of implicit ratings thatuserId
s have provided in themsd
dataset.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Min num implicit ratings for a song
print("Minimum implicit ratings for a song: ")
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(min("count")).show()
# Avg num implicit ratings per songs
print("Average implicit ratings per song: ")
____.filter(____("____") > 0).groupBy("____").count().____(avg("____")).____()
# Min num implicit ratings from a user
print("Minimum implicit ratings from a user: ")
msd.____(____("num_plays") > ____).____("userId").____().select(____("____")).____()
# Avg num implicit ratings for users
print("Average implicit ratings per user: ")
____.filter(col("num_plays") > 0).____("____").____().____(____("____")).____()