Joining with ratings
In the video exercise, you saw how to use transformations in PySpark by joining the film
and ratings
tables to create a new column that stores the average rating per customer.
In this exercise, you're going to create more synergies between the film
and ratings
tables by using the same techniques you learned in the video exercise to calculate the average rating for every film.
The PySpark DataFrame with films, film_df
and the PySpark DataFrame with ratings, rating_df
, are available in your workspace.
This exercise is part of the course
Introduction to Data Engineering
Exercise instructions
- Take the mean rating per
film_id
, and assign the result toratings_per_film_df
. - Complete the
.join()
statement to join on thefilm_id
column. - Show the first
5
results of the resulting DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Use groupBy and mean to aggregate the column
ratings_per_film_df = rating_df.____('____').____('____')
# Join the tables using the film_id column
film_df_with_ratings = film_df.join(
ratings_per_film_df,
film_df.film_id==____
)
# Show the 5 first results
print(film_df_with_ratings.____)