1. Dealing with sparsity
Now you are capable of not only generating recommendations for any user in your dataset, but also predicting what rating users might give items they have not come across using KNN.
2. Sparse matrices
This works great for dense datasets in which every item has been reviewed by multiple people, which ensures that the K nearest neighbors are genuinely similar to your user. But what if the data is less full?
3. Sparse matrices
This is actually a common concern in real-world rating data as the number of users and items are generally quite high and the number of reviews are quite low.
4. Sparse matrices
We call the percentage of a DataFrame that is empty the DataFrame's sparsity. In other words, the number of empty cells over the number of cells with data.
5. Measuring sparsity
Let's bring back a larger version of the book rating DataFrame we used in the last chapter and find how sparse it is.
6. Measuring sparsity
We can check the sparsity of a DataFrame by counting the number of missing values in it using isnull dot values dot sum
Finding the full number of cells in the DataFrame using dot size
and then dividing the empty count by the total.
Here we see that the DataFrame is only just over 1% filled, so it's quite sparse.
7. Why sparsity matters
Why does this matter? This can create problems if we were to use KNN with sparse data because KNN requires you to find the K nearest users that have rated the item. Take the DataFrame here.
8. Why sparsity matters
Let's say we wanted to estimate what User 1 would give item 5. We would find the n nearest ratings of the item,
9. Why sparsity matters
but in this case, there are only 2 KNN or other users that have rated the item.
10. Why sparsity matters
Therefore we would have to return an average of all reviews (2 in this case) because there is no other data. This does not actually take the similarities into account.
11. Measuring sparsity per column
You can understand the scale of this issue by simply counting the number of actual reviews for each book using notnull dot sum.
We can see that a large number of books have only received one or two reviews.
12. Matrix factorization
So what alternatives do we have? Thankfully, we can leverage matrix factorization to deal with this problem remarkably well and create some quite interesting features while doing so.
13. Matrix factorization
Matrix factorization is when we decompose the user-rating matrix into the product of two lower dimensionality matrices.
These matrices shown here are factors of the original matrix on the left, if you were to find the product of the two of them it would be this original matrix.
By finding factors of the sparse matrix and then multiplying them together
14. Matrix factorization
we can be left will a fully filled matrix. We will dig into matrix factorization in the next few lessons but first we should review how matrix multiplication works.
15. Matrix multiplication
To multiply two rectangular matrices,
16. Matrix multiplication
The number of rows in the first matrix M here
17. Matrix multiplication
and the number of columns in the second matrix N here do not have to match
18. Matrix multiplication
But the number of columns of the first matrix must match the number of rows in the second.
19. Matrix multiplication
This results in an n by m matrix.
20. Matrix multiplication
This same multiplication can be performed in python using numpy's dot product function. Here we can see the dot product of matrix_A (a three by two matrix) and matrix_B (a two by three matrix),
21. Matrix multiplication
Is a three by three matrix.
22. Let's practice!
We can dig into why this is so useful soon, but let's practice what we have learned in this lesson first!