Cosine Similarity

1. Cosine Similarity

In the previous lesson we learned about calculating TFIDF weights. These weights mean very little however, until we learn how to use them.

2. TFIDF output

Let's consider the following tibble of words, word counts, tf, idf, and tf_idf values. For article 20, the word January had a TFIDF value of 0.214. We understand how this was calculated, January appeared 4 times in article 20, and January was only in a hand full of articles overall, so we ended up with this value. So what? Well, we can use these values to assess how similar two articles are if we use something called the cosine similarity.

3. Cosine similarity

Cosine similarity is a measure of similarity between two vectors, and is defined as the measure of the angle formed when representing the vectors in a multi-dimensional space. If you represent two texts as vectors in some multi-dimensional space and calculate the angle between them, you have created a measure for their similarity.

4. Cosine similarity formula

This can be calculated by finding the dot product of the two vectors. I don't want to drown you in mathematics here, I just want you to understand that we can calculate how similar two vectors are by calculating their dot product, and that there is a nice, clean formula we can follow to find text similarity.

5. Finding similarities part I

Let's review the output of the bind_tf_idf function. The output is a tibble, of tf_idf values by word and by article. We can use the pairwise_similarity function provided by the widyr R package, to calculate the cosine similarity values for each pair of articles.

6. Pairwise similarity

pairwise_similarity requires 4 arguements. A table or tibble of information. The items you want to compare. These could be articles, tweets, or something else. The feature of interest, in our case we look at words, And the name of the column with comparison values. We will use tf_idf from the previous slide.

7. Finding similarities part II

Using the tf_idf values we calculated earlier and the pairwise_similarity function, we find how similar each article is to the rest of the articles. In this case, articles 17 and 16 are the most similar. Similarity values range between 0 and 1, with 1 being identical articles, and 0 representing that articles share no non-common words.

8. Cosine similarity use-cases

There are several use cases for cosine similarity. Of course they all start with the idea that similarity between texts needs to be assessed. You can find duplicate or similar pieces of text, as well as use the resulting similarity scores in clustering and classification analysis,

9. Let's practice!

Let's calculate the similarity between pieces of text.

This exercise is part of the course

Introduction to Natural Language Processing in R

IntermediateSkill Level

5.0+

Start Course for Free

Chapter 1 of Introduction to Natural Langauge Processing prepares you for running your first analysis on text. You will explore regular expressions and tokenization, two of the most common components of most analysis tasks. With regular expressions, you can search for any pattern you can think of, and with tokenization, you can prepare and clean text for more sophisticated analysis. This chapter is necessary for tackling the techniques we will learn in the remaining chapters of this course.

Exercise 1: Regular expression basics Exercise 2: Practicing syntax with grep Exercise 3: Exploring regular expression functions.Exercise 4: Tokenization Exercise 5: tidytext functions Exercise 6: Tokenization: sentences Exercise 7: Text cleaning basics Exercise 8: Text preprocessing: remove stop words Exercise 9: Text preprocessing: Stemming

In this chapter, you will learn the most common and studied ways to analyze text. You will look at creating a text corpus, expanding a bag-of-words representation into a TFIDF matrix, and use cosine-similarity metrics to determine how similar two pieces of text are to each other. You build on your foundations for practicing NLP before you dive into applications of NLP in chapters 3 and 4.

Exercise 1: Understanding an R corpus Exercise 2: Explore an R corpus Exercise 3: Creating a tibble from a corpus Exercise 4: Creating a corpus Exercise 5: The bag-of-words representation Exercise 6: Practice BoW Exercise 7: BoW Example Exercise 8: Sparse matrices Exercise 9: The TFIDF Exercise 10: Manual calculations Exercise 11: TFIDF Practice Exercise 12: Cosine Similarity

Current Exercise

Exercise 13: An example of failing at text analysis Exercise 14: Cosine similarity example

Chapter 3 focuses on two common text analysis approaches, classification modeling, and topic modeling. If you are working on text analysis projects, you will inevitably use one or both of these methods. This chapter teaches you how to perform both techniques and provides insight into how to approach these techniques from a practical point of you.

Exercise 1: Preparing text for modeling Exercise 2: Data preparation Exercise 3: Removing sparse terms Exercise 4: Classification modeling Exercise 5: Classification modeling example Exercise 6: Confusion matrices Exercise 7: TFIDF tibble vs dtm Exercise 8: Introduction to topic modeling Exercise 9: LDA practice Exercise 10: Assigning topics to documents Exercise 11: LDA in practice Exercise 12: Testing perplexity Exercise 13: Reviewing LDA results

In chapter 4 we cover two staples of natural language processing, sentiment analysis, and word embeddings. These are two analysis techniques that are a must for anyone learning the fundamentals of text analysis. Furthermore, you will briefly learn about BERT, part-of-speech tagging, and named entity recognition. Almost 15 different analysis techniques were covered in this course, so chapter 4 ends by recapping all of the great techniques you will learn about in this course.

Exercise 1: Sentiment analysis Exercise 2: tidytext lexicons Exercise 3: Sentiment scores Exercise 4: Sentiment and emotion Exercise 5: Word embeddings Exercise 6: h2o practice Exercise 7: word2vec Exercise 8: Additional NLP analysis Exercise 9: Reviewing methods #1 Exercise 10: Review methods #2 Exercise 11: Conclusion