Model evaluation using MSE
After generating the predicted ratings from the test data using ALS model, in this final part of the exercise, you'll prepare the data for calculating Mean Square Error (MSE) of the model. The MSE is the average value of (original rating – predicted rating)**2
for all users and indicates the absolute fit of the model to the data.
To do this, first, you'll organize both the ratings_final
and predictions
RDDs to make a tuple of ((user, product), rating)). In both RDDs the mapping is:
0: user
1: product
2: rating
Then you'll join transformed RDDs and finally apply a squared difference function along with mean()
to get the MSE.
Remember, you have a SparkContext sc
available in your workspace. Also, ratings_final
and predictions
RDD are already available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Organize
ratings
RDD to make((user, product), rating)
. - Organize
predictions
RDD to make((user, product), rating)
. - Join the prediction RDD with the ratings RDD.
- Evaluate the model using MSE between original rating and predicted rating and print it.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Prepare ratings data
rates = ratings_final.____(lambda r: ((r[0], r[1]), ____))
# Prepare predictions data
preds = predictions.____(lambda r: ((____, ____), ____))
# Join the ratings data with predictions data
rates_and_preds = rates.____(preds)
# Calculate and print MSE
MSE = rates_and_preds.____(lambda r: (r[1][0] - r[1]____)**2).mean()
print("Mean Squared Error of the model for the test data = {:.2f}".format(____))