Correct format and distinct users
Take a look at the R
dataframe. Notice that it is in conventional or "wide" format with a different movie in each column. Also notice that the User
's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.
This exercise is part of the course
Building Recommendation Engines with PySpark
Exercise instructions
- Import the
monotonically_increasing_id
package frompyspark.sql.functions
and view theR
dataframe using the.show()
method. - Use the
to_long()
function to convert theR
dataframe into a "long" data frame. Call the new dataframeratings
. - Create a dataframe called
users
that contains all the.distinct()
users from the dataframe and repartition the dataframe into one partition using the.coalesce(1)
method. - Use the
monotonically_increasing_id()
method inside ofwithColumn()
to create a new column in the users dataframe that contains a unique integer for each user. Call this columnuserId
. Be sure to call the.persist()
method on the final dataframe to ensure the new integer IDs persist.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import monotonically_increasing_id and show R
from pyspark.sql.functions import ____
R.show()
# Use the to_long() function to convert the dataframe to the "long" format.
ratings = to_long(____)
ratings.show()
# Get unique users and repartition to 1 partition
users = ratings.select("____").____().____()
# Create a new column of unique integers called "userId" in the users dataframe.
users = users.withColumn("____", monotonically_increasing_id()).____()
users.show()