1. Learn
  2. /
  3. Courses
  4. /
  5. Building Recommendation Engines with PySpark

Exercise

MovieLens Summary Statistics

Let's take the groupBy() method a bit further.

Once you've applied the .groupBy() method to a dataframe, you can subsequently run aggregate functions such as .sum(), .avg(), .min() and have the results grouped. This exercise will walk you through how this is done. The min and avg functions have been imported from pyspark.sql.functions for you.

Instructions

100 XP
  • Group the data by movieId and use the .count() method to calculate how many ratings each movie has received. From there, call the .select() method to select the following metrics:
    • min("count") to get the smallest number of ratings that any movie in the dataset. This first one is given to you as an example.
    • avg("count") to get the average number of ratings per movie
  • Do the same thing, but this time group by userId to get the min and avg number of ratings.