Get startedGet started for free

Using K-nearest neighbors

1. Using K-nearest neighbors

You are now able to find similar items based on how the users in your dataset have rated them.

2. Beyond similar items

But what if we wanted to not only find similarly rated items, but actually predict how a user might rate an item even if it is not similar to any item they have seen! One approach is to find similar users using a K nearest neighbors model and see how they liked the item.

3. K-nearest neighbors

As a reminder, K-NN finds the k users that are closest measured by a specified metric, to the user in question. It then averages the rating those users gave the item we are trying to get a rating for. In this example, k equals 3, so it finds the 3 nearest users and gets their rating. This allows us to predict how we think a user might feel about an item, even if they haven't seen it before. Scikit-learn has a pre-built KNN model we will use later, but it is valuable to understand how it works by going through the process step by step first.

4. User-user similarity

We continue with our book rating DataFrame, this time predicting what rating User_1 might give the book "Catch-22" which they have not read. We previously generated the similarity scores between all items, in the item-based DataFrame. As we are now looking to find similar users, we repeat the process, but on the user-based DataFrame, and assign the users as columns and indices.

5. Understanding the similarity matrix

Examining the output, we see a grid of all users as rows and columns, and where they meet, their similarity score.

6. Understanding the similarity matrix

So User_1 and User_3 here are quite similar.

7. Understanding the similarity matrix

While User_1 and User_2 are not.

8. Step by step KNN

Lets set k to 3 and find the KNN to User_1. We select User_1's similarity values Then order them to find the 3 most similar users getting just their names using dot index.

9. Step by step KNN

We then find the ratings these users gave to the book from our original ratings DataFrame and get the mean. This rating represents the rating the user would likely give to Catch-22 based on the ratings users similar to them gave it.

10. Using scikit-learn's KNN

Let's look how this can be done using scikit-learn. For this, we need two datasets: the centered user-based rating DataFrame, with a row per user, a column per item, and values of the ratings centered around 0, and the original user_ratings_table with uncentered scores and missing values.

11. Using scikit-learn's KNN

We drop the catch-22 column as that will be our target, and separate the user we are predicting for. Note we use double brackets to keep this as a DataFrame. The original raw ratings for the item we are predicting on are extracted. Think of this as your Y values in your model.

12. Using scikit-learn's KNN

As we only care about neighbors that have read the book, we filter the users that have actually rated it. We similarly drop the rows in the ratings that are empty. Think of other_users_x and other_users_y as your x and y training values, while target_users_x is the data you are trying to predict with.

13. Using scikit-learn's KNN

We can then import and instantiate the KNeighborsRegressor model from sklearn specifying cosine similarities as the metric. We fit it the same way we fit any model and predict on the user values we want to predict.

14. Using scikit-learn's KNN

An advantage of using the sklearn approach is that you can quickly change parameters, or even try out classification as opposed to regression, where the most common rating is predicted as opposed to the average like seen here!

15. Let's practice!

Now its time to try this yourself.