1. Cosine similarity
We now know how to compute vectors
out of text documents. With this representation in mind, let us now explore techniques that will allow us to determine how similar two vectors and consequentially two documents, are to each other. More specifically, we will learn about the cosine similarity score which is one of the most popularly used similarity metrics in NLP.
2. Mathematical formula
Very simply put, the cosine similarity score of two vectors is the cosine of the angle between the vectors.
Mathematically, it is the ratio of the dot product of the vectors and the product of the magnitude of the two vectors.
Let's walk through what this formula really means.
3. The dot product
The dot product is computed by summing the product of values across corresponding dimensions of the vectors. Let's say we have two n-dimensional vectors
V and W
as shown. Then, the dot product
here would be v1 times w1
plus v2 times w2 and so on until vn times wn. As an example,
consider two vectors A and B.
By applying the formula above,
we see that the dot product
comes to 37.
4. Magnitude of a vector
The magnitude of a vector is essentially the length of the vector. Mathematically, it is defined as the square root of the sum of the squares of values across all the dimensions of a vector. Therefore, for an n-dimensional vector
V,
the magnitude,mod V, is computed as
the square root of
v1 square plus v2 square and so on until vn square. Consider
the vector A
from before. Using the above formula,
we compute its magnitude
to be root 66.
5. The cosine score
We are now in a position to compute the cosine similarity score of
A and B.
It is the dot product,
which is 37, divided by the product of the magnitudes of A and B, which are root 66 and root 38 respectively.
The value comes out
to be approximately 0.738, which is the value of the cosine of the angle theta between the two vectors.
6. Cosine Score: points to remember
Since the cosine score is simply the cosine of the angle between two vectors, its value is bounded
between -1 and 1. However, in NLP, document vectors almost always use non-negative weights. Therefore, cosine scores vary between 0 and 1
where 0 indicates no similarity and 1 indicates that the documents are identical. Finally, since the cosine score ignores the magnitude of the vectors, it is fairly robust
to document length. This may be an advantage or a disadvantage depending on the use case.
7. Implementation using scikit-learn
Scikit-learn offers a cosine_similarity function that outputs a similarity matrix containing the pairwise cosine scores for a set of vectors. You can import cosine_similarity
from sklearn dot metrics dot pairwise. However, remember that cosine_similarity takes in 2-D arrays as arguments. Passing in 1-D arrays will throw an error. Let us compute the cosine similarity scores
of vectors A and B from before. We see that we get the same answer of
0.738 from before.
8. Let's practice!
That's enough theory for now. Let's practice!