LoslegenKostenlos loslegen

Splitting by tokens

Splitting documents using RecursiveCharacterTextSplitter or CharacterTextSplitter is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.

In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document.

tiktoken and all necessary classes have been imported for you.

Diese Übung ist Teil des Kurses

Retrieval Augmented Generation (RAG) with LangChain

Kurs anzeigen

Anleitung zur Übung

  • Get the encoding for gpt-4o-mini from tiktoken so you can check the number of tokens in each chunk.
  • Create a text splitter to split based on the number of tokens using the GPT-4o-Mini encoding.
  • Split the PDF, stored in document, into chunks using token_splitter.

Interaktive Übung

Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.

# Get the encoding for gpt-4o-mini
encoding = ____

# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)

# Split the PDF into chunks
chunks = ____

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")
Code bearbeiten und ausführen