1. Building a plot line based recommender
In this lesson,
we will use tf-idf vectors and cosine scores to build a recommender system that suggests movies based on overviews.
2. Movie recommender
We've a dataset containing movie overviews.
Here, we can see two movies, Shanghai Triad
and Cry,
the Beloved Country and their overviews.
3. Movie recommender
Our task is to build a system that takes in a movie title and outputs a list of movies that has similar plot lines. For instance, if we passed in
'The Godfather', we could expect output like this.
Notice how a lot of the movies listed here have to do with crime and gangsters, just like The Godfather.
4. Steps
Following are the steps involved. The first step, as always, is to preprocess
movie overviews. The next step is to generate
the tf-idf vectors for our overviews. Finally, we generate a
cosine similarity matrix which contains the pairwise similarity scores of every movie with every other movie. Once the cosine similarity matrix is computed, we can proceed to build the recommender function.
5. The recommender function
We will build a recommender function as part of this course. Let's take a look at how it works. The recommender function takes a movie title,
the cosine similarity matrix and an indices series as arguments. The indices series is a reverse mapping of movie titles with their indices in the original dataframe. The function extracts
the pairwise cosine similarity scores of the movie passed in with every other movie. Next, it
sorts these scores in descending order. Finally, it
outputs the titles of movies corresponding to the highest similarity scores. Note that the function ignores
the highest similarity score of 1. This is because the movie most similar to a given movie is the movie itself!
6. Generating tf-idf vectors
Let's say we already have the preprocessed movie overviews as 'movie_plots'. We already know
how to generate the tf-idf vectors.
7. Generating cosine similarity matrix
Generating the cosine similarity matrix is also extremely simple. We simply pass
in tfidf_matrix as both the first and second argument of cosine_similarity. This generates a matrix
that contains the pairwise similarity score of every movie with every other movie. The value corresponding to the ith row and the jth column is the cosine similarity score of movie i with movie j. Notice that the diagonal elements of this matrix is 1. This is because, as stated earlier, the cosine similarity score of movie k with itself is 1.
8. The linear_kernel function
The magnitude of a tf-idf vector
is always 1. Recall from the previous lesson that the cosine score is computed as the ratio of the dot product and the product of the magnitude of the vectors. Since the magnitude is 1, the cosine score of two tf-idf vectors
is equal to their dot product! This fact can help us greatly improve the speed of computation
of our cosine similarity matrix as we do not need to compute the magnitudes while working with tf-idf vectors. Therefore, while working with tf-idf vectors, we can use
the linear_kernel function which computes the pairwise dot product of every vector with every other vector.
9. Generating cosine similarity matrix
Let us replace the cosine_similarity function
with linear_kernel. As you
can see, the output remains the same but it takes significantly lesser time to compute.
10. The get_recommendations function
The recommender function and the indices series described earlier will be built in the exercises. You can use this function
to generate recommendations
using the cosine similarity matrix.
11. Let's practice!
In the exercises, you will build recommendation systems of your own and see them in action. Let's practice!