Calculate sparsity
As you know, ALS works well with sparse datasets. Let's see how much of the ratings
matrix is actually empty.
Remember that sparsity is calculated by the number of cells in a matrix that contain a rating divided by the total number of values that matrix could hold given the number of users and items (movies). In other words, dividing the number of ratings present in the matrix by the product of users and movies in the matrix and subtracting that from 1 will give us the sparsity or the percentage of the ratings
matrix that is empty.
This exercise is part of the course
Building Recommendation Engines with PySpark
Exercise instructions
- Calculate the
numerator
of the sparsity metric by counting the total number of ratings that are contained in theratings
matrix. - Calculate the number of
distinct()
userIds
anddistinct()
movieIds
in theratings
matrix. - Calculate the
denominator
of the sparsity metric by multiplying the number of users by the number of movies in theratings
matrix. - Calculate and print the sparsity by dividing the
numerator
by thedenominator
, subtracting from 1 and multiplying by 100. The1.0
is added to ensure the sparsity is returned as a decimal and not an integer.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Count the total number of ratings in the dataset
numerator = ____.select("____").count()
# Count the number of distinct userIds and distinct movieIds
num_users = ____.select("____").____().count()
num_movies = ____.select("____").____().count()
# Set the denominator equal to the number of users multiplied by the number of movies
denominator = ____ * ____
# Divide the numerator by the denominator
sparsity = (1.0 - (____ *1.0)/____)*100
print("The ratings dataframe is ", "%.2f" % sparsity + "% empty.")