Assigning integer id's to movies
Let's do the same thing to the movies. Then let's join the new user IDs and movie IDs into one dataframe.
Bu egzersiz, kursun bir parçasıdır
Building Recommendation Engines with PySpark
Egzersiz talimatları
- Use the
.select()and the.distinct()methods to extract all uniqueMovies from theratingsdataframe. - Repartition the
moviesdataframe to one partition usingcoalesce(). - Complete the partial code provided to assign unique integer IDs to each movie. Name the new column
movieIdand call the.persist()method on the resulting dataframe. - Join the
ratingsdataframe to theusersdataframe and subsequently to themoviesdataframe. Call the resultmovie_ratings.
Uygulamalı etkileşimli egzersiz
Bu egzersizi bu örnek kodu tamamlayarak deneyin.
# Extract the distinct movie id's
movies = ratings.select("____").distinct()
# Repartition the data to have only one partition.
movies = movies.coalesce(____)
# Create a new column of movieId integers.
movies = movies.withColumn("____", monotonically_increasing_id()).____()
# Join the ratings, users and movies dataframes
movie_ratings = ratings.join(____, "User", "left").join(____, "Movie", "left")
movie_ratings.show()