Assigning integer id's to movies
Let's do the same thing to the movies. Then let's join the new user IDs and movie IDs into one dataframe.
Cet exercice fait partie du cours
Building Recommendation Engines with PySpark
Instructions
- Use the
.select()
and the.distinct()
methods to extract all uniqueMovie
s from theratings
dataframe. - Repartition the
movies
dataframe to one partition usingcoalesce()
. - Complete the partial code provided to assign unique integer IDs to each movie. Name the new column
movieId
and call the.persist()
method on the resulting dataframe. - Join the
ratings
dataframe to theusers
dataframe and subsequently to themovies
dataframe. Call the resultmovie_ratings
.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Extract the distinct movie id's
movies = ratings.select("____").distinct()
# Repartition the data to have only one partition.
movies = movies.coalesce(____)
# Create a new column of movieId integers.
movies = movies.withColumn("____", monotonically_increasing_id()).____()
# Join the ratings, users and movies dataframes
movie_ratings = ratings.join(____, "User", "left").join(____, "Movie", "left")
movie_ratings.show()