Word vectors and similarity

1. Word vectors and semantic similarity

In this video, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other. You'll also learn about how to use word vectors and how to take advantage of them in your NLP application.

2. Comparing semantic similarity

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens. The Doc, Token and Span objects have a dot similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are. One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included. For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the models documentation.

3. Similarity examples (1)

Here's an example. Let's say we want to find out whether two documents are similar. First, we load the medium English model, "en_core_web_md". We can then create two doc objects and use the first doc's similarity method to compare it to the second. Here, a fairly high similarity score of 0-point-86 is predicted for "I like fast food" and "I like pizza". The same works for tokens. According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0-point-7.

4. Similarity examples (2)

You can also use the similarity methods to compare different types of objects. For example, a document and a token. Here, the similarity score is pretty low and the two objects are considered fairly dissimilar. Here's another example comparing a span – "pizza and burgers" – to a document. The score returned here is 0-point-61, so it's determined kind of similar.

5. How does spaCy predict similarity?

But how does spaCy do this under the hood? Similarity is determined using word vectors, multi-dimensional representations of meanings of words. You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text. Vectors can be added to spaCy's statistical models. By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary. Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors. That's also why you usually get more value out of shorter phrases with fewer irrelevant words.

6. Word vectors in spaCy

To give you an idea of what those vectors look like, here's an example. First, we load the medium model again, which ships with word vectors. Next, we can process a text and look up a token's vector using the dot vector attribute. The result is a 300-dimensional vector of the word "banana".

7. Similarity depends on the application context

Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform. However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do. Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

8. Let's practice!

Now it's your turn. Let's try out some of spaCy's word vectors and use them to predict similarities.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.