Get Started

Estimating embedding costs with tiktoken

Now that we've created a database and collection to store the Netflix films and TV shows, we can begin embedding data.

Before embedding a large dataset, it's important to do a cost estimate to ensure you don't go over any budget restraints. Because OpenAI models are priced by number of tokens inputted, we'll use OpenAI's tiktoken library to count the number of tokens and convert them into a dollar cost.

You've been provided with documents, which is a list containing all of the data to embed. You'll iterate over the list, encode each document, and count the total number of tokens. Finally, you'll use the model's pricing to convert this into a cost.

This is a part of the course

“Introduction to Embeddings with the OpenAI API”

View Course

Exercise instructions

  • Load the encoder for the text-embedding-3-small model.
  • Encode each text in documents, and sum the result to find the total number of tokens in the dataset, total_tokens.
  • Print the total number of tokens and the cost of those tokens using the model's cost_per_1k_tokens defined for you.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Load the encoder for the OpenAI text-embedding-3-small model
enc = tiktoken.encoding_for_model("____")

# Encode each text in documents and calculate the total tokens
total_tokens = ____(____(____) for ____ in documents)

cost_per_1k_tokens = 0.00002

# Display number of tokens and cost
print('Total tokens:', ____)
print('Cost:', ____)
Edit and Run Code