Get Started

Text similarity

1. Text similarity

Let's round-out the chapter by looking at how we can compute the similarity between two pieces of text using embeddings!

2. Recap...

Recall that embedding models map semantically similar texts more closely together in the vector space. This means that we can measure how semantically similar two pieces of text are by computing the distance between the vectors in the vector space. Being able to measure similarity is what will enable the embeddings applications we've discussed previously: semantic search, recommendations, and classification.

3. Measuring similarity

There are many different metrics for computing similarity in higher dimensions, but we'll be using cosine distance. The cosine distance uses linear algebra, a branch of mathematics, to evaluate the similarity between two vectors. Let's start with a small example of two points in two dimensions. To compute the cosine distance between the two points, we can import distance from scipy-dot-spatial, and call distance-dot-cosine, passing the coordinates of the two points. The result is one. This is difficult to interpret on its own, but the distance can range from zero to two, where smaller numbers indicate greater similarity. Let's try this out with text embeddings!

4. Example: Comparing headline similarity

Let's return to an earlier dataset of news article information, including headlines and topics, which are stored in a list of dictionaries. The article headlines have already been embedded using OpenAI's embedding model and stored under each article's embedding key. We'll use these embeddings to compare how similar our headlines are to another piece of text, and find the most similar one.

5. Example: Comparing headline similarity

To create embeddings in a more repeatable way, we'll define a custom function to send a request to the API, and extract and return embeddings from the response. This function can be called on a single string, or on a list of string and always returns a list of lists. To just return a single list of embeddings for the single string case, make sure to zero-index the function's result.

6. Example: Comparing headline similarity

First, we'll import distance from scipy-dot-spatial for the cosine distance calculations, and NumPy to access its argmin function, which returns the index of the smallest value in a list. Let's start with a piece of text to compare to our embedded headlines: computer. We start by embedding this text using our create_embeddings custom function, remembering to zero-index the result. To find the most similar headline to this text, we'll loop over each article, calculating the cosine distance between each embedded headline and the embedded query. We start by creating an empty list to store our distances, and loop over each article in our articles list of dictionaries. Next, we calculate the cosine distance between the text and headline by calling distance-dot-cosine, passing it the embedded text and headline. Finally, we append this distance to the distances list. The most similar headline will have the smallest cosine distance, so we can use NumPy's argmin function to return the index of the smallest value in the distances list; then, use it to subset the article at this index and return its headline. There we have it! The most similar headline to the text "computer" was one about a new tech product. Pretty neat!

7. Let's practice!

And now, it's your turn!