Finding similarities

1. Finding similarities

We have been focusing on finding similar users so far in this chapter. This is called user-based collaborative filtering. Comparisons between items, or item-based collaborative filtering, is also possible.

2. Item-based recommendations

It assumes if Item A and B receive similar reviews, either positive or negative,

3. Item-based recommendations

Then however other people feel about A,

4. Item-based recommendations

They should feel the same way about B.

5. User-based to item-based

If we have our data prepped as we did in the last few exercises we can switch between these two approaches,

6. User-based to item-based

by transposing the matrices giving us the items as rows and the users as columns.

7. User-based to item-based

This can be achieved in pandas by looking at the book rating DataFrame (user_ratings_pivot) we have generated previously and shown here. By calling dot T on a DataFrame to get its transposed version. We can see the user-based matrix on top, and the corresponding item-based matrix on the bottom. We will discuss in more depth which matrix is preferred later in this chapter, but the high-level answer, like with many questions in data science, is that it depends on the data. We will focus on item-based filtering for now as the items can be a little more relatable.

8. Cosine similarities

With the item-based matrix containing a row per book, shown here, we can calculate the similarities and distances between items in the dataset, like what we did with our content-based recommendations last chapter. We'll continue to use cosine distance, but is worth noting that as we have centered the data around zero, the cosine values can now range from -1 to 1, with 1 being the most similar, and -1 the least. This does not have any impact on the process, so don't be concerned if you see some negative cosine values!

9. Cosine similarities

Even though the range of the output can be different, the way we calculate similarities is the same. Let's compare two books, The Lord of the Rings and The Hobbit, from our dataset with the cosine distance. As a reminder, cosine similarity compares two NumPy arrays, so we need to do some reshaping first.

10. Cosine similarities

We first get the rows we want to compare.

11. Cosine similarities

Then we need to turn them into a NumPy array with dot values.

12. Cosine similarities

And reshape them into a 1d array. As you can see, the two books are found to be quite similar (remember the values are between -1 and 1). This is expected as they are by the same author, but if we repeat it with two very different books, we might even get a negative value.

13. Cosine similarities

Comparing items is all well and good, but you want of course to start making recommendations. Let's do so by finding the most similar items overall! To do this, we need to find the similarities between all the items at once. Just like we did with content-based recommendations, we can call cosine_similarity on the full dataset. Resulting in a similarity matrix between all items. Tidying this up by wrapping it in a DataFrame with the index and columns the item names gets us a usable lookup table with the similarities for all items.

14. Cosine similarities

With this matrix calculated, we can even make recommendations by finding the items that have been rated most similar to the one a user liked by selecting the item you want to compare against and sort its similarities. Here you can see that the most similarly rated different item to The Hobbit was Lord of the Rings, which makes sense as they share characters and author.

15. Let's practice!

Let us try this with the movies dataset you have been working on!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.