Intro to content-based recommendations

1. Intro to content-based recommendations

So far we have looked at making recommendations based solely on how the entire population feels about items. While these recommendations can be useful, they aren't personalized.

2. What are content-based recommendations?

In this chapter, we will move to more targeted models by recommending items based on their similarities to items a user has liked in the past. For example, if a user likes book A, and we calculate that book A and book B are similar, we believe the user will like book B. We will address how to calculate what items are similar and which ones are not. We can do so by comparing the attributes of our items. The recommendations made by finding items with similar attributes are called content-based recommendations.

3. Items' attributes or characteristics

For example, if we were looking at a dataset describing books, the attributes could be the author of the book, its publishing date, its length, or its genre, really any descriptive information. A big advantage of using an item's attributes over user feedback is that you can make recommendations for any items you have attribute data on. This includes even brand new items that users have not seen yet. Content-based models require us to use any available attributes to build profiles of items in a way that allows us to mathematically compare between them. This allows us for example to find the most similar items and recommend them.

4. Vectorizing your attributes

This is best done by encoding each item as a vector. Here we can see an example with a vector for each item stored as a row and each feature as a column. Why this shape you might ask? It is extremely valuable to have your data in this format so the distance and similarities between items can be easily calculated, which is vital for generating recommendations. We'll discuss how to calculate distances and similarities between vectors later in the course. First, we will cover how to convert the most common data format for attributes to this shape. We will continue using the book dataset from chapter 1, but this time we introduce an additional book_genre table.

5. One to many relationships

This book_genre table, as seen here on the left, contains a one to many reference of books to their genres. This type of one to many lookup is very common in relational databases. Remember from this table, we want to create a new table that contains a single row per item, encoding whether or not it has that attribute like you see here on the right.

6. Crosstabulation

To transform this data we can use pandas' crosstab function. The crosstab function generates the cross-tabulation of two (or more) factors, and here we want to use it to find the cross-tabulation of the book titles and the genres they have been labeled with.

7. Crosstabulation

We call pd-dot-crosstab, passing in the book titles as the first argument, and the book genres as the second argument. The first argument will become the rows, and the second becomes the columns. Here we can see the desired result.

8. Let's practice!

Great, now we have our data in a format that will allow us to calculate similarities and make recommendations. Time to try these data transformations yourself.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.