Get startedGet started for free

Correct format and distinct users

Take a look at the R dataframe. Notice that it is in conventional or "wide" format with a different movie in each column. Also notice that the User's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.

This exercise is part of the course

Building Recommendation Engines with PySpark

View Course

Exercise instructions

  • Import the monotonically_increasing_id package from pyspark.sql.functions and view the R dataframe using the .show() method.
  • Use the to_long() function to convert the R dataframe into a "long" data frame. Call the new dataframe ratings.
  • Create a dataframe called users that contains all the .distinct() users from the dataframe and repartition the dataframe into one partition using the .coalesce(1) method.
  • Use the monotonically_increasing_id() method inside of withColumn() to create a new column in the users dataframe that contains a unique integer for each user. Call this column userId. Be sure to call the .persist() method on the final dataframe to ensure the new integer IDs persist.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import monotonically_increasing_id and show R
from pyspark.sql.functions import ____
R.show()

# Use the to_long() function to convert the dataframe to the "long" format.
ratings = to_long(____)
ratings.show()

# Get unique users and repartition to 1 partition
users = ratings.select("____").____().____()

# Create a new column of unique integers called "userId" in the users dataframe.
users = users.withColumn("____", monotonically_increasing_id()).____()
users.show()
Edit and Run Code