Get startedGet started for free

Measuring semantic similarity with spaCy

1. Measuring semantic similarity with spaCy

Welcome! Let's determine how to find semantically similar contexts using spaCy.

2. The semantic similarity method

Semantic similarity is the process of analyzing multiple sentences to identify similarities between them. Determining semantic similarity can help us to categorize texts into predefined categories or detect relevant texts, or to flag duplicate content. Suppose we need to find relevant customer questions to the word "price". Given a list of sentences such as "what is the cheapest flight from Boston to Seattle?", only the first sentence is related, because it contains the word cheapest. To measure how similar two pieces of text are, we need to calculate their similarity scores.

3. Similarity score

Semantic similarity score is a metric that is defined over texts, where the similarity between two texts is measured using their representative word vectors. We will use cosine similarity and word vectors to measure similarity between two pieces of text. The cosine similarity of two vectors is the cosine of the angle that's created by these two vectors, and it will always have a number between 0 and 1. A larger cosine similarity metric (closer to one) represents more similar word vectors.

4. Token similarity

We can calculate similarity scores between Token objects by using the context around tokens. Let's say we want to find out whether two words of Pizza and Pasta from the sentences "We eat Pizza" and "We like to eat Pasta" are similar, and what their similarity score is. We first create two Doc containers per sentence and extract tokens associated to words pizza and pasta by using indices of each word. Then we use the first token's similarity function to calculate the similarity score between pizza and pasta by calling token1-dot-similarity(token2). According to the word vectors, the tokens "pizza" and "pasta" are somewhat similar, and receive a similarity score of 0-point-685.

5. Span similarity

Similarly, spaCy can calculate the similarity score of two Spans of texts. We previously learned that Span is a slice from a Doc container. Subsetting a Doc container results in a Span object. Similar to the Token class, the Span class also has a span-dot-similarity() method that can be used to calculate the similarity score between two spans. We can see the Span objects of "eat pizza" and "eat pasta" have a much higher cosine similarity score of 0-dot-936 and hence are a lot more similar compared to "eat pizza" and "like to eat pasta" spans with similarity score of 0-dot-588.

6. Doc similarity

We can also determine whether two documents are similar using spaCy. First, we create Doc containers per document and use the first document's -dot-similarity() method to compare it to the second document. The cosine similarity for "I like to play basketball" and "I love to play basketball" is 0-dot-975 and close to 1. This shows the strength of word vectors in understanding meanings of the words and semantic similarity. spaCy Doc vectors default to an average of word vectors in a document.

7. Sentence similarity

Lastly, we can use spaCy to find relevant sentences to a given keyword. For example, given a list of customer questions, we can find the most relevant sentence to a given keyword, such as price. Similar to Token, Span, and Doc objects, a spaCy sentence (from sentence-dot-sents) also has a -dot-similarity() method that can be used to compare a sentence with a word vector of a keyword. We observe that the similarity score of the first question "What is the cheapest flight from Boston to Seattle" is the highest and hence most relevant to the keyword price.

8. Let's practice!

Let's practice!