Comparing all your movies at once

While finding the Jaccard similarity between any two individual movies in your dataset is great for small-scale analyses, it can prove slow on larger datasets to make recommendations.

In this exercise, you will find the similarities between all movies and store them in a DataFrame for quick and easy lookup.

When finding the similarities between the rows in a DataFrame, you could run through all pairs and calculate them individually, but it's more efficient to use the pdist() (pairwise distance) function from scipy.

This can be reshaped into the desired rectangular shape using squareform() from the same library. Since you want similarity values as opposed to distances, you should subtract the values from 1.

movie_cross_table has once again been loaded for you.

This exercise is part of the course

Building Recommendation Engines in Python

View Course

Exercise instructions

Find the Jaccard distance measures between all movies and assign the results to jaccard_similarity_array.
Create a DataFrame from the jaccard_similarity_array with movie_genre_df.index as its rows and columns.
Print the top 5 rows of the DataFrame and examine the similarity scores.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
jaccard_distances = ____(movie_cross_table.values, metric='____')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - ____(____)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.____(jaccard_similarity_array, ____=movie_cross_table.index, ____=movie_cross_table.index)

# Print the top 5 rows of the DataFrame
print(jaccard_similarity_df.head())

Edit and Run Code