Get startedGet started for free

Collaborative filtering

1. Collaborative filtering

In the last chapter, we used the items a customer liked to make suggestions of other similar items. This works well when we have a lot of information about the items, but not much data on how people feel about them. In this chapter, we will find the users that have the most similar preferences to the user we are making recommendations for and based on that group's preferences, make suggestions.

2. Collaborative filtering

This form of recommendation is called collaborative filtering. Collaborative filtering is the name given to the prediction, or filtering, of items that might interest a user based on the preferences of similar users. It works around the premise that person A has similar tastes to person B and C.

3. Collaborative filtering

and both person B and C also like a certain item,

4. Collaborative filtering

then it is likely that person A would also like that new item.

5. Finding similar users

But how do we go about programmatically finding users with similar interests? Rating data is often difficult to compare between users. Even here it is not immediately clear how User_1 and User_2 compare.

6. Finding similar users

We need to get this data into a matrix of users and the items they rated. Now we can see what items both users have seen. Based on this matrix we can compare across users, here it is apparent that User_1 and User_3 have more similar preferences than User_1 and User_2.

7. Working with real data

Time for some real data! We will continue working with the book ratings dataset from the previous chapters containing each user, the book they rated, and the rating score.

8. Pivoting our data

As the data is in a DataFrame, pandas' pivot method can be used to reshape the data around specified columns. We want the users as the index, the columns representing the books, and the ratings as the corresponding values like you see here.

9. Data sparsity

The first thing that may become apparent after this transform is the number of missing entries, demonstrated by the NaN values. This is expected - a user will rarely have rated every item, and it's similarly rare that an item will have been rated by every person. This is an issue, as most similarity metrics do not handle missing data very well. How can we deal with this? We cannot just drop all the rows and columns that have missing data as with data this sparse that could be the whole data frame!

10. Filling the missing values

Similarly, you might suggest filling the empty values with 0s, which might be valid for some machine learning models, but can create issues with recommendation engines. Take for example the second user here. They loved Catcher in the Rye, and enjoyed Fifty Shades of Grey, but have not rated The Great Gatsby. If we were to fill this NaN with a 0, we would be incorrectly implying they greatly disliked the book compared to the others, which we can't say for sure.

11. Filling the missing values

One alternative is to center each user's ratings around 0 by deducting the row average and then fill in the missing values with 0. This means the missing data is replaced with neutral scores.

12. Filling the missing values

We first find the row means. Then subtract it from the rest of the row, you can see the rows centered around 0 here.

13. Filling the missing values

We then fill the NaNs with 0s. This is not a perfect solution, as the values lose some of their interpretability, and these values should not be used as predictions in themselves, but suffice when comparing between users.

14. Let's practice!

We can now calculate similarities between users and we will get to that soon, but first let's work through shaping the data!