Get startedGet started for free

Semantic search and enriched embeddings

1. Semantic search and enriched embeddings

In this chapter, we'll begin to apply what you've already learned to implement the most popular embedding applications: semantic search, recommendation systems, and classification tasks. Let's start with semantic search.

2. Semantic search

Recall that semantic search engines use embeddings to return the most semantically similar results to a search query. For example, a news website could enable semantic search by embedding news article information like the headline and topic. A user searching "computer" could then be returned a selection of computer-related headlines.

3. Semantic search

There are three steps to semantic search: embed the search query and texts to compare against, compute the cosine distances between the embedded search query and other embedded texts, and finally, extract the texts with the smallest cosine distance. Let's implement this semantic article search using the OpenAI API and Python.

4. Enriched embeddings

Here's the headlines data we'll be working with. We'll embed not only the headline text, but also the topic and keywords. To do this, we'll combine the information from each article into a single string that reflects the information stored in the dictionary, and the keywords are delimited with a comma and space.

5. Combining features with F-strings

To combine these features for each article, we'll define a function called create_article_text. This function uses an F-string, or formatted string, to return the desired string structure. F-strings allow us to insert variables into strings without having to convert them into strings and concatenate them. F-strings are created by specifying an f before the quotes, and note that we've defined a multi-line string using triple quotes. To insert an object, we use curly brackets and specify the variable or other Python code to insert. For the article headline and topic, these values are extracted using their keys and inserted into the string at the desired locations. The keywords are a little trickier because they were stored as a list rather than a string. To convert the keywords list into a string, we use the join list method, which joins the contents of the list together into a single string. The method is called on the string we want to delimit each keyword with, in this case, a comma and space. Calling the function on the final headline shows the text in the desired formatted string.

6. Creating enriched embeddings

To apply the function and combine the features for each article, we use a list comprehension, calling our function on each article in articles. Finally, to embed these strings, we call the create_embeddings function on the result. Recall, that this creates a list of embeddings for each input using the OpenAI API. Now that we have our embeddings, it's time to compute cosine distances.

7. Computing distances

We'll define a function called find_n_closest, that takes a query_vector, the embedded search query, and embeddings to compare against, our embedded articles, and returns the n most similar results based on their cosine distances. For each embedding, we calculate the cosine distance to the query_vector, and store it in a dictionary along with the embedding's index, which we append to a list called distances. To sort the distances list by the distance key in each dictionary, we use the sorted function and its key argument. The key argument takes a function to evaluate each dictionary in distances and sort by; in this case, it's a lambda function that accesses the distance key from each dictionary. Finally, the function returns the closest n results.

8. Returning the search results

Time to bring all the semantic search pieces together! We'll query our embeddings using the text, "AI". First, we embed the search query using our create_embeddings function and extract its embeddings by zero-indexing. Next, we use the find_n_closest function to find the three closest hits based on our article_embeddings. Finally, to extract the most similar headlines, we loop through each hit, using the hit's index to subset the corresponding headline, and print. As we'd expect, the top result specifically mentions AI, and the others are on similar topics.

9. Let's practice!

Now it's your turn!