MovieLens Summary Statistics
Let's take the groupBy()
method a bit further.
Once you've applied the .groupBy()
method to a dataframe, you can subsequently run aggregate functions such as .sum()
, .avg()
, .min()
and have the results grouped. This exercise will walk you through how this is done. The min
and avg
functions have been imported from pyspark.sql.functions
for you.
Cet exercice fait partie du cours
Building Recommendation Engines with PySpark
Instructions
- Group the data by
movieId
and use the.count()
method to calculate how many ratings each movie has received. From there, call the.select()
method to select the following metrics:min("count")
to get the smallest number of ratings that any movie in the dataset. This first one is given to you as an example.avg("count")
to get the average number of ratings per movie
- Do the same thing, but this time group by
userId
to get themin
andavg
number of ratings.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Min num ratings for movies
print("Movie with the fewest ratings: ")
ratings.groupBy("movieId").count().select(min("count")).show()
# Avg num ratings per movie
print("Avg num ratings per movie: ")
____.groupBy("____").count().____(avg("____")).____()
# Min num ratings for user
print("User with the fewest ratings: ")
ratings.____("userId").____().select(____("____")).____()
# Avg num ratings per users
print("Avg num ratings per user: ")
____.____("____").____().____(____("____")).____()