Aan de slagGa gratis aan de slag

The GroupBy and Filter methods

Now that we know a little more about the dataset, let's look at some general summary metrics of the ratings dataset and see how many ratings the movies have and how many ratings each users has provided.

Two common methods that will be helpful to you as you aggregate summary statistics in Spark are the .filter() and the .groupBy() methods. The .filter() method allows you to filter out any data that doesn't meet your specified criteria.

Deze oefening maakt deel uit van de cursus

Building Recommendation Engines with PySpark

Cursus bekijken

Oefeninstructies

  • Import col from the pyspark.sql.functions, and view the ratings dataset using the .show().
  • Apply the .filter() method on the ratings dataset with the following filter inside the parenthesis in order to include only userId's less than 100: col("userId") < 100.
  • Call the .groupBy() method on the ratings dataset to group the data by userId. Call the .count() method to see how many movies each userId has rated.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Import the requisite packages
from pyspark.sql.____ import ____

# View the ratings dataset
____.____()

# Filter to show only userIds less than 100
ratings.____(col("____") < ____).____()

# Group data by userId, count ratings
ratings.____("____").count().show()
Code bewerken en uitvoeren