1. Cosine Similarity
In the previous lesson we learned about calculating TFIDF weights.
These weights mean very little however, until we learn how to use them.
2. TFIDF output
Let's consider the following tibble of words, word counts, tf, idf, and tf_idf values.
For article 20, the word January had a TFIDF value of 0.214. We understand how this was calculated, January appeared 4 times in article 20, and January was only in a hand full of articles overall, so we ended up with this value. So what? Well, we can use these values to assess how similar two articles are if we use something called the cosine similarity.
3. Cosine similarity
Cosine similarity is a measure of similarity between two vectors, and is defined as
the measure of the angle formed when representing the vectors in a multi-dimensional space.
If you represent two texts as vectors in some multi-dimensional space and calculate the angle between them, you have created a measure for their similarity.
4. Cosine similarity formula
This can be calculated
by finding the dot product of the two vectors.
I don't want to drown you in mathematics here, I just want you to understand that we can calculate how similar two vectors are by calculating their dot product, and that there is a nice, clean formula we can follow to find text similarity.
5. Finding similarities part I
Let's review the output
of the bind_tf_idf function. The output is a tibble,
of tf_idf values by word and by article. We can use the pairwise_similarity function provided by the widyr R package, to calculate the cosine similarity values for each pair of articles.
6. Pairwise similarity
pairwise_similarity requires 4 arguements.
A table or tibble of information.
The items you want to compare. These could be articles, tweets, or something else.
The feature of interest, in our case we look at words,
And the name of the column with comparison values. We will use tf_idf from the previous slide.
7. Finding similarities part II
Using the tf_idf values we calculated earlier and the pairwise_similarity function, we find how similar each article is to the rest of the articles.
In this case, articles 17 and 16 are the most similar. Similarity values range between 0 and 1, with 1 being identical articles, and 0 representing that articles share no non-common words.
8. Cosine similarity use-cases
There are several use cases for cosine similarity. Of course they all start with the idea that similarity between texts needs to be assessed. You can
find duplicate or similar pieces of text,
as well as use the resulting similarity scores in clustering and classification analysis,
9. Let's practice!
Let's calculate the similarity between pieces of text.