Get startedGet started for free

Dealing with sparsity

1. Dealing with sparsity

Now you are capable of not only generating recommendations for any user in your dataset, but also predicting what rating users might give items they have not come across using KNN.

2. Sparse matrices

This works great for dense datasets in which every item has been reviewed by multiple people, which ensures that the K nearest neighbors are genuinely similar to your user. But what if the data is less full?

3. Sparse matrices

This is actually a common concern in real-world rating data as the number of users and items are generally quite high and the number of reviews are quite low.

4. Sparse matrices

We call the percentage of a DataFrame that is empty the DataFrame's sparsity. In other words, the number of empty cells over the number of cells with data.

5. Measuring sparsity

Let's bring back a larger version of the book rating DataFrame we used in the last chapter and find how sparse it is.

6. Measuring sparsity

We can check the sparsity of a DataFrame by counting the number of missing values in it using isnull dot values dot sum Finding the full number of cells in the DataFrame using dot size and then dividing the empty count by the total. Here we see that the DataFrame is only just over 1% filled, so it's quite sparse.

7. Why sparsity matters

Why does this matter? This can create problems if we were to use KNN with sparse data because KNN requires you to find the K nearest users that have rated the item. Take the DataFrame here.

8. Why sparsity matters

Let's say we wanted to estimate what User 1 would give item 5. We would find the n nearest ratings of the item,

9. Why sparsity matters

but in this case, there are only 2 KNN or other users that have rated the item.

10. Why sparsity matters

Therefore we would have to return an average of all reviews (2 in this case) because there is no other data. This does not actually take the similarities into account.

11. Measuring sparsity per column

You can understand the scale of this issue by simply counting the number of actual reviews for each book using notnull dot sum. We can see that a large number of books have only received one or two reviews.

12. Matrix factorization

So what alternatives do we have? Thankfully, we can leverage matrix factorization to deal with this problem remarkably well and create some quite interesting features while doing so.

13. Matrix factorization

Matrix factorization is when we decompose the user-rating matrix into the product of two lower dimensionality matrices. These matrices shown here are factors of the original matrix on the left, if you were to find the product of the two of them it would be this original matrix. By finding factors of the sparse matrix and then multiplying them together

14. Matrix factorization

we can be left will a fully filled matrix. We will dig into matrix factorization in the next few lessons but first we should review how matrix multiplication works.

15. Matrix multiplication

To multiply two rectangular matrices,

16. Matrix multiplication

The number of rows in the first matrix M here

17. Matrix multiplication

and the number of columns in the second matrix N here do not have to match

18. Matrix multiplication

But the number of columns of the first matrix must match the number of rows in the second.

19. Matrix multiplication

This results in an n by m matrix.

20. Matrix multiplication

This same multiplication can be performed in python using numpy's dot product function. Here we can see the dot product of matrix_A (a three by two matrix) and matrix_B (a two by three matrix),

21. Matrix multiplication

Is a three by three matrix.

22. Let's practice!

We can dig into why this is so useful soon, but let's practice what we have learned in this lesson first!