Get startedGet started for free

Investigating the vector space

1. Investigating the vector space

Welcome back! In this video, we'll discuss embedding multiple inputs, and how to store, handle, and explore these embeddings. Let's dive in!

2. Example: Embedding headlines

We'll be working with a dataset of news articles stored in a list of dictionaries. Each article has a headline stored under the headline key and a topic stored under the topic key.

3. Example: Embedding headlines

We'll embed each headline's text and add them back into the headlines dictionary, stored under the embedding key.

4. Embedding multiple inputs

To start, we'll extract each article's headline using a list comprehension, accessing the headline key from each dictionary. To compute the embeddings, we can pass this entire list as an input to the create method. Batching the embeddings in this way is much more efficient than making API calls for each input.

5. Response

The response output only differs from the single-input case in one way: where before, the list under the data key contained a single dictionary for the embeddings, in the multiple-input case, there is one dictionary for each input.

6. Embedding multiple inputs

To extract these embeddings from the response and store them in the articles list of dictionaries, we loop over the indexes and articles using enumerate. For each article, we assign the embedding at the same index in the response to the article's embedding key. Let's print the first two articles. Voilà! We've successfully created embeddings for multiple inputs! Let's investigate these numbers more closely.

7. How long is the embeddings vector?

For the first article in the list, the embedding model returned 1536 numbers representing the semantic meaning of its headline, or in other words, its position, or vector, in the vector space. Let's take a look at another, longer headline. We get 1536 numbers again! This is a key property of OpenAI's embedding models - they always return 1536 numbers, no matter the input.

8. Dimensionality reduction and t-SNE

Let's visualize our embeddings to better understand the model's results. We'll first need to reduce the number of dimensions from 1536 to something more manageable, like 2. There are lots of techniques for performing dimensionality reduction, but we'll be using t-SNE, or t-distributed Stochastic Neighbor Embedding. Although we won't be going into the mechanics of how t-SNE works, you can check out the DataCamp link to learn more.

9. Implementing t-SNE

We'll implement t-SNE using scikit-learn, a popular Python package for machine learning tasks. First, we import TSNE from sklearn-dot-manifold and numpy as np. Next, we'll extract the embeddings from our articles list of dictionaries using list comprehension. To implement t-SNE, we create a TSNE instance and assign it to the tsne variable. We've specified two arguments: n_components, the number of dimensions we want to reduce to, two, and perplexity, which is used by the algorithm in the transformation. The default value of 30 is normally fine, but for smaller datasets, it must be reduced to a number less than the number of data points. We have 10 articles in our dataset, so we've reduced perplexity to five. Finally, to perform the t-SNE transformation, we call the fit_transform method on the tsne object, passing it the embeddings as a NumPy array. This will return the transformed embeddings in a NumPy array with n_components dimensions, which we can now visualize. Although t-SNE is useful for exploring and visualizing higher dimensions, it will result in the loss of some information in the transformation, so it should be used with caution.

10. Visualizing the embeddings

To visualize these transformed embeddings, we call plt-dot-scatter from Matplotlib on the first and second columns of the embeddings_2d array. We'll also include some code to extract the article topics, annotate the plot with them, and display the plot.

11. Visualizing the embeddings

Here's the plot. Notice that headlines with the same topic were clustered more closely together! In other words, the model captured the semantic meaning of the headlines and mapped them based on it! In the next video, we'll discuss how to compute the similarity between embeddings to enable applications like semantic search.

12. Let's practice!

Time to practice!