Estimating embedding costs with tiktoken
Now that we've created a database and collection to store the Netflix films and TV shows, we can begin embedding data.
Before embedding a large dataset, it's important to do a cost estimate to ensure you don't go over any budget restraints. Because OpenAI models are priced by number of tokens inputted, we'll use OpenAI's tiktoken
library to count the number of tokens and convert them into a dollar cost.
You've been provided with documents
, which is a list containing all of the data to embed. You'll iterate over the list, encode each document, and count the total number of tokens. Finally, you'll use the model's pricing to convert this into a cost.
This is a part of the course
“Introduction to Embeddings with the OpenAI API”
Exercise instructions
- Load the encoder for the
text-embedding-3-small
model. - Encode each text in
documents
, and sum the result to find the total number of tokens in the dataset,total_tokens
. - Print the total number of tokens and the cost of those tokens using the model's
cost_per_1k_tokens
defined for you.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the encoder for the OpenAI text-embedding-3-small model
enc = tiktoken.encoding_for_model("____")
# Encode each text in documents and calculate the total tokens
total_tokens = ____(____(____) for ____ in documents)
cost_per_1k_tokens = 0.00002
# Display number of tokens and cost
print('Total tokens:', ____)
print('Cost:', ____)
This exercise is part of the course
Introduction to Embeddings with the OpenAI API
Unlock more advanced AI applications, like semantic search and recommendation engines, using OpenAI's embedding model!
To enable embedding applications in production, you'll need an efficient vector storage and querying solution: enter vector databases! You'll learn how vector databases can help scale embedding applications and begin creating and adding to your very own vector databases using Chroma.
Exercise 1: Vector databases for embedding systemsExercise 2: To metadata or not to metadata?Exercise 3: Choosing a vector database solutionExercise 4: Creating vector databases with ChromaDBExercise 5: Getting started with ChromaDBExercise 6: Estimating embedding costs with tiktokenExercise 7: Adding data to the collectionExercise 8: Querying and updating the databaseExercise 9: Querying the Netflix collectionExercise 10: Updating and deleting items from a collectionExercise 11: Multiple queries and filteringExercise 12: Querying with multiple textsExercise 13: Filtering using metadataExercise 14: Congratulations!What is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.