Get startedGet started for free

Loading Movie Lens dataset into RDDs

Collaborative filtering is a technique for recommender systems wherein users' ratings and interactions with various products are used to recommend new ones. With the advent of Machine Learning and parallelized processing of data, recommender systems have become widely popular in recent years, and are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags. In this 3-part exercise, your goal is to develop a simple movie recommendation system using PySpark MLlib using a subset of MovieLens 100k dataset.

In the first part, you'll first load the MovieLens data (ratings.csv) into RDD and from each line in the RDD which is formatted as userId,movieId,rating,timestamp, you'll need to map the MovieLens data to a Ratings object (userID, productID, rating) after removing timestamp column and finally you'll split the RDD into training and test RDDs.

Remember, you have a SparkContext sc available in your workspace. Also file_path variable (which is the path to the ratings.csv file), and ALS class (i.e. Rating) are already available in your workspace.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

  • Load the ratings.csv dataset into an RDD.
  • Split the RDD using , as a delimiter.
  • For each line of the RDD, using Rating() class create a tuple of userID, productID, rating.
  • Randomly split the data into training data and test data (0.8 and 0.2).

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load the data into RDD
data = sc.____(file_path)

# Split the RDD 
ratings = data.____(lambda l: l.split('____'))

# Transform the ratings RDD 
ratings_final = ratings.____(lambda line: Rating(int(line[0]), int(____), float(____)))

# Split the data into training and test
training_data, test_data = ratings_final.____([0.8, 0.2])
Edit and Run Code