1. Non-personalized recommendations
The first type of recommendations we will generate are called non-personalized recommendations. They are called this as they are made to all users, without taking their preferences into account.
2. Non-personalized ratings
One example is recommending the items most frequently seen together like you can see here on Amazon. This might not select the 'best' items or items that are most suited to you, but there is a good chance you will not hate them as they are so common.
3. Finding the most popular items
We can demonstrate how to find this with Python using this book rating DataFrame shown here. Each row corresponds to an instance of a reader completing a book with the book title stored in the book column.
4. Finding the most popular items
By specifying the column of interest, "book" in this case, and using pandas' DataFrame value_counts method we obtain the counts of occurrences of each of the books, from highest to lowest.
5. Finding the most popular items
We get just the names of the books by calling the index value.
6. Finding the most liked items
While this is a good start, we haven't incorporated any data about what readers thought about each book.
Let's include that data here with an additional 'rating' column showing the reader's rating out of 5 for each book they read. We can use this to create alternative recommendations by finding the most highly rated books.
7. Finding the most liked items
This is done by averaging the rating of each of the books and examining the highest-ranked ones.
We select only the columns of interest (the title and the rating) and then specify which column we will be grouping by (the book title) We then find the mean of the groupby object using dot mean.
This returns a DataFrame with a row per book and the average rating it receives.
Unfortunately unlike value_counts(), the groupby() method does not automatically sort the output.
8. Finding the most liked items
Therefore we will use the sort_values method, specifying that we want to sort by the ratings, in descending order (it is ascending by default).
Examining the sorted DataFrame using the dot-head, we now see the top values do indeed have very high ratings, but the books may look very unfamiliar.
This is because items with very low numbers of ratings can skew the results. A book with only one rating has a solid chance of its only rating being 5 stars pushing it to the top, while a book that has been reviewed hundreds of times is likely to have at least one non-perfect review.
9. Finding the most liked items
We can test our theory by seeing how many times the highest-ranked books have been reviewed.
As predicted, they each occur only once in the dataset.
10. Finding the most liked popular items
But by combining the initial work of counting occurrences with average ratings, we can get very useful recommendations.
We can use the value_counts method once again to find the counts of occurrences and store them as book_frequency.
We then cut down this Series by creating a mask of only books that have been reviewed more than one hundred times in our dataset and store as frequently_reviewed_books.
Note the index value is called here as we want the names of the books instead of the counts of their occurrences.
11. Finding the most liked popular items
We then take a subset of our overall ratings DataFrame by selecting only the rows referring to books in frequently_reviewed_books using the isin method.
This subset of reviews can now be used in the same way as earlier to find the highest rated books on average.
Inspecting the result we now see that the top books no longer have full marks but are more recognizable titles.
12. Let's practice!
Now let's put what we've learned to the test!