Get Started

Creating vector databases with ChromaDB

1. Creating vector databases with ChromaDB

Time to start getting hands on with ChromaDB!

2. Installing ChromaDB

Chroma comes in 2 flavors: a local mode where everything happens inside Python, and a client/server mode where a ChromaDB server is running in a separate process. Client/Server mode requires running a separate process for the chroma server and is better suited for production systems. We'll only be looking at local mode, which is the simplest way to run Chroma and is well-suited for development and prototyping.

3. Connecting to the database

In order to connect and query the database, we first need to create a client. We import chromadb, and create a persistent client by calling PersistentClient. Persistent clients save the database files to disk at the path specified.

4. Creating a collection

To add embeddings to the database, we must first create a collection. Collections are analogous to tables, where we can create as many as we want to store our data. To create the collection, we use the .create_collection() method. When creating a collection, we need to pass the name of our collection, which is used as a reference, and the function for creating the embeddings; here, we specify the OpenAI embedding function and API key. In Chroma, and in many other vector databases, a default embedding function is used automatically if one isn't specified.

5. Inspecting collections

The .list_collections() method lists all of the collections in the database, so we can verify that our collection was created successfully.

6. Inserting embeddings

We are now ready to add embeddings into the collection. We can do so with the collection.add method. In this example, we're adding a single document. Chroma will not automatically generate ids for these documents, so they must be specified. Since the collection is already aware of the embedding function, it will embed the source texts automatically using the function specified. Most of the time, we'll insert multiple documents at once, which we can do by passing multiple ids and documents.

7. Inspecting a collection

After inserting documents, we can inspect the collection with two methods: collection.count() will return the total number of documents in the collection.

8. Inspecting a collection

And collection-dot-peek() will return the first ten items in the collection. As we can see, the embeddings were created automatically when we inserted the texts.

9. Retrieving items

We can also retrieve particular items by their ID using the .get() method.

10. Netflix dataset

In the following exercises, we'll insert a dataset of Netflix titles into a Chroma database. For each title, we'll embed a source text including the title, description, and categories. While this is not a massive dataset, we must not forget that each of these texts is going to be sent to the OpenAI embedding endpoint and therefore cost money. Before inserting a sizable dataset into a collection, it's important to get an idea of the cost.

11. Estimating embedding cost

OpenAI provides the cost per thousand tokens on their model pricing page, which means we can find the total cost by multiplying this value by the number of tokens in the texts we'll embed and dividing by a thousand. We can count these tokens with OpenAI's tiktoken library.

12. Estimating embedding cost

tiktoken can convert any text into tokens. First, we use the encoding_for_model function to get a token encoder for the embedding model we're using. To calculate the total number of tokens, we use the following Pythonic code. This reads: for each text in documents, encode it using the encoder and take the length to obtain the number of tokens in the text. Finally, sum the results. This code is much more concise and efficient than looping through the documents; it just takes a little time to get used to if you're used to looping. Finally, we calculate the price by multiplying total_tokens by cost_per_1k_tokens over 1000, and print the result. We'll work with a smaller subset in the exercises of the first 1000 titles.

13. Let's practice!

Now it's your turn!