Loading Movie Lens dataset into RDDs
Collaborative filtering is a technique for recommender systems wherein users' ratings and interactions with various products are used to recommend new ones. With the advent of Machine Learning and parallelized processing of data, recommender systems have become widely popular in recent years, and are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags. In this 3-part exercise, your goal is to develop a simple movie recommendation system using PySpark MLlib using a subset of MovieLens 100k dataset.
In the first part, you'll first load the MovieLens data (ratings.csv
) into RDD and from each line in the RDD which is formatted as userId,movieId,rating,timestamp
, you'll need to map the MovieLens data to a Ratings object (userID, productID, rating
) after removing timestamp column and finally you'll split the RDD into training and test RDDs.
Remember, you have a SparkContext sc
available in your workspace. Also file_path
variable (which is the path to the ratings.csv
file), and ALS class (i.e. Rating
) are already available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Load the
ratings.csv
dataset into an RDD. - Split the RDD using
,
as a delimiter. - For each line of the RDD, using
Rating()
class create a tuple ofuserID, productID, rating
. - Randomly split the data into training data and test data (0.8 and 0.2).
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the data into RDD
data = sc.____(file_path)
# Split the RDD
ratings = data.____(lambda l: l.split('____'))
# Transform the ratings RDD
ratings_final = ratings.____(lambda line: Rating(int(line[0]), int(____), float(____)))
# Split the data into training and test
training_data, test_data = ratings_final.____([0.8, 0.2])