Get startedGet started for free

Collaborative filtering

1. Introduction to Collaborative filtering

In the previous video, you have been introduced with the three C'S of machine learning. In this video, we will start with the 1st C which is Collaborative filtering, and gain a basic understanding of Recommender Systems in Spark. Let's get started.

2. What is Collaborative filtering?

Collaborative filtering is a method of making automatic predictions about the interests of a user by collecting preferences or taste information from many users. Collaborative filtering is one of the most commonly used algorithms in recommender systems. Collaborative filtering has two approaches: The User-User approach and Item-Item approach. The User-User approach finds users that are similar to the target user and uses their collaborative ratings to make recommendations for the target user. Item-Item approach finds and recommends items that are similar or related to items associated with the target user. Now let's take a look at different components that are needed to build a recommendation system in PySpark.

3. Rating class in pyspark.mllib.recommendation submodule

The Rating class in pyspark-dot-mllib-dot-recommendation submodule is a wrapper around tuple (user, product and rating). The Rating class is very useful for parsing the RDD and creating a tuple of user, product and rating. Here is a simple example of how you can create an instance of Rating class "r" with the values of user equals 1, product equals 2 and rating equals 5-point-0. Once the Rating class is created, you can extract the user, product and rating value using the index of "r" instance. In this example, r[0], r[1] and r[2] shows the userId, ProductID and rating for the "r" instance. Splitting the data

4. Splitting the data using randomSplit()

into training and testing sets is an integral part of machine learning. The training portion will be used to train the model, while the testing data is used to evaluate the model’s performance. Typically, a larger portion of the data is assigned for training and a smaller portion for testing. PySpark's randomSplit function can be used to randomly split the data with the provided weights and returns multiple RDDs. In this example, we first create an RDD which consists of numbers 1 to 10 and using randomSplit function we create two RDDs with 60:40 ratio. The output of the randomSplit function shows training RDDs contains 6 element whereas test RDD contains 4 elements.

5. Alternating Least Squares (ALS)

The alternating least squares (ALS) algorithm available in spark-dot-mllib helps to find products that the customers might like, based on their previous purchases or ratings. The ALS-dot-train method requires that we represent Rating objects as (UserId, ItemId, Rating) tuples along with training parameters rank and iterations. rank represents the number of features. Iterations represent the number of iterations to run the least squares computation. Here is an example of running the ALS model. First, we create an RDD from a list or Rating objects and print out the contents of the RDD using collect action. Next, we use ALS-dot-train to train the training data as shown in this example.

6. predictAll()

model, the next step is predicting the ratings for the user and product pairs. The predictAll method takes an RDD of user id and product id pair and returns a prediction for each pair. In order to get the example to work, let's create an RDD from a list of tuples containing userId and productId using Spark Context's parallelize method. Next, we apply the predictAll method on the unrated_RDD. Running collect Action on predictions shows a list of predicted ratings generated by ALS model for the userId 1 and productIds 1 and 2. For evaluating the model

7. Model evaluation

trained using ALS, we can use the Mean Squared Error (MSE). The MSE measures the average of the squares of the errors between what is estimated and the existing data. Continuing on our previous example, we'll first organize our ratings and prediction data to make (user, product) the rating. Next, we will join the ratings RDD with the prediction RDD and the result looks as follows. Finally, we apply a squared difference function to the map transformation of the rates_preds RDD and then use the mean to get the MSE. Now it's your turn

8. Let's practice!

to try your hand at collaborative filtering in PySpark