The GroupBy and Filter methods
Now that we know a little more about the dataset, let's look at some general summary metrics of the ratings
dataset and see how many ratings the movies have and how many ratings each users has provided.
Two common methods that will be helpful to you as you aggregate summary statistics in Spark are the .filter()
and the .groupBy()
methods. The .filter()
method allows you to filter out any data that doesn't meet your specified criteria.
Cet exercice fait partie du cours
Building Recommendation Engines with PySpark
Instructions
- Import
col
from thepyspark.sql.functions
, and view theratings
dataset using the.show()
. - Apply the
.filter()
method on theratings
dataset with the following filter inside the parenthesis in order to include onlyuserId
's less than 100:col("userId") < 100
. - Call the
.groupBy()
method on theratings
dataset to group the data byuserId
. Call the.count()
method to see how many movies eachuserId
has rated.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Import the requisite packages
from pyspark.sql.____ import ____
# View the ratings dataset
____.____()
# Filter to show only userIds less than 100
ratings.____(col("____") < ____).____()
# Group data by userId, count ratings
ratings.____("____").count().show()