Comparing all your movies at once
While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.
In this exercise, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.
When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it's more efficient to use the pdist()
(pairwise distance) function from scipy
.
This can be reshaped into the desired rectangular shape using squareform()
from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.
movie_cross_table
has once again been loaded for you.
This exercise is part of the course
Building Recommendation Engines in Python
Exercise instructions
- Find the Jaccard distance measures between all movies and assign the results to
jaccard_similarity_array
. - Create a DataFrame from the
jaccard_similarity_array
withmovie_genre_df.index
as its rows and columns. - Print the top 5 rows of the DataFrame and examine the similarity scores.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import functions from scipy
from scipy.spatial.distance import pdist, squareform
# Calculate all pairwise distances
jaccard_distances = ____(movie_cross_table.values, metric='____')
# Convert the distances to a square matrix
jaccard_similarity_array = 1 - ____(____)
# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.____(jaccard_similarity_array, ____=movie_cross_table.index, ____=movie_cross_table.index)
# Print the top 5 rows of the DataFrame
print(jaccard_similarity_df.head())