CommencerCommencer gratuitement

Correct format and distinct users

Take a look at the R dataframe. Notice that it is in conventional or "wide" format with a different movie in each column. Also notice that the User's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.

Cet exercice fait partie du cours

Building Recommendation Engines with PySpark

Afficher le cours

Instructions

  • Import the monotonically_increasing_id package from pyspark.sql.functions and view the R dataframe using the .show() method.
  • Use the to_long() function to convert the R dataframe into a "long" data frame. Call the new dataframe ratings.
  • Create a dataframe called users that contains all the .distinct() users from the dataframe and repartition the dataframe into one partition using the .coalesce(1) method.
  • Use the monotonically_increasing_id() method inside of withColumn() to create a new column in the users dataframe that contains a unique integer for each user. Call this column userId. Be sure to call the .persist() method on the final dataframe to ensure the new integer IDs persist.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Import monotonically_increasing_id and show R
from pyspark.sql.functions import ____
R.show()

# Use the to_long() function to convert the dataframe to the "long" format.
ratings = to_long(____)
ratings.show()

# Get unique users and repartition to 1 partition
users = ratings.select("____").____().____()

# Create a new column of unique integers called "userId" in the users dataframe.
users = users.withColumn("____", monotonically_increasing_id()).____()
users.show()
Modifier et exécuter le code