Get startedGet started for free

Joining with ratings

In the video exercise, you saw how to use transformations in PySpark by joining the film and ratings tables to create a new column that stores the average rating per customer. In this exercise, you're going to create more synergies between the film and ratings tables by using the same techniques you learned in the video exercise to calculate the average rating for every film.

The PySpark DataFrame with films, film_df and the PySpark DataFrame with ratings, rating_df, are available in your workspace.

This exercise is part of the course

Introduction to Data Engineering

View Course

Exercise instructions

  • Take the mean rating per film_id, and assign the result to ratings_per_film_df.
  • Complete the .join() statement to join on the film_id column.
  • Show the first 5 results of the resulting DataFrame.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Use groupBy and mean to aggregate the column
ratings_per_film_df = rating_df.____('____').____('____')

# Join the tables using the film_id column
film_df_with_ratings = film_df.join(
    ratings_per_film_df,
    film_df.film_id==____
)

# Show the 5 first results
print(film_df_with_ratings.____)
Edit and Run Code