Get startedGet started for free

Making content-based recommendations

1. Making content-based recommendations

With our data formatted, we can begin making comparisons and recommendations, but to do so, we will need a way of calculating similarity between rows.

2. Introducing the Jaccard similarity

The metric we will use to measure similarity between items in our newly encoded dataset is called the Jaccard similarity. The Jaccard similarity is the ratio of attributes that two items have in common, divided by the total number of their combined attributes. These are respectively shown by the two orange shaded areas in the Venn diagrams here. It will always be between 0 and 1 and the more attributes the two items have in common, the higher the score.

3. Calculating Jaccard similarity between books

We will continue working on the book genre DataFrame created in the last video called genres_array_df. This contains one row for each item (books in this case) and a column for each genre.

4. Calculating Jaccard similarity between books

To calculate the Jaccard similarity between the books in the DataFrame we first need to import jaccard_score from the sklearn metrics library. This function takes two vectors (rows in our case) and calculates the similarity value. So we can take the row for The Hobbit And the row for A Game of Thrones And find the Jaccard score. While this is valuable for the lookup of individual similarities, it is often more useful to have the similarities of all your items calculated at once in an easy to access DataFrame.

5. Finding the distance between all items

To get all of these similarities at once for our data we will call upon two helpful functions from the scipy package. First pdist (short for pairwise distance) helps us find all the distances at once, using Jaccard as the metric argument. This returns a condensed matrix, which contains all the distances in a 1D array. We then use squareform to get this 1D data into the rectangular shape we need.

6. Finding the distance between all items

Note that pdist calculates the Jaccard distance which is a measure of how different rows are from each other. As we want the complement of this, the similarity, we subtract the values from 1.

7. Creating a usable distance table

We can now wrap this similarity array in a DataFrame for ease of use. We create a DataFrame with the newly generated jaccard_similarity_array as the main argument and set both the index and column arguments to the title column of the distance_df DataFrame. Let's take a look at the distance_df DataFrame we just created.

8. Comparing books

This distance DataFrame can be used to look up any pairings of Books to see how similar they are. Let's look up the similarity between The Hobbit and A Game of Thrones again by using book titles to filter the distance_df DataFrame. This returns 0-point-75, a reasonable score, as they are both fun action-packed fantasy books. If we perform a similar comparison between The Hobbit and The Great Gatsby, we get a much lower score of point-one-five. Not a huge surprise as the Great Gatsby has very little in common with The Hobbit.

9. Finding the most similar books

Finally, while comparing two books is useful, it is most valuable when you can use it to find a new book that is similar to the one you just read and enjoyed. For this, we select the column containing the book we want to compare with and then sort the results using dot-sort_values(). The ascending argument must be set to False to show the highest ranked books first. Unsurprisingly, all the top recommendations are similar fantasy adventure books!

10. Let's practice!

This method of recommendation is valuable for instances when you have good descriptive attributes on the items you want to compare, lets generate recommendations using these techniques with the movie dataset from chapter one.