Aan de slagGa gratis aan de slag

Splitting by tokens

Splitting documents using RecursiveCharacterTextSplitter or CharacterTextSplitter is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.

In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document.

tiktoken and all necessary classes have been imported for you.

Deze oefening maakt deel uit van de cursus

Retrieval Augmented Generation (RAG) with LangChain

Cursus bekijken

Oefeninstructies

  • Get the encoding for gpt-4o-mini from tiktoken so you can check the number of tokens in each chunk.
  • Create a text splitter to split based on the number of tokens using the GPT-4o-Mini encoding.
  • Split the PDF, stored in document, into chunks using token_splitter.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Get the encoding for gpt-4o-mini
encoding = ____

# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)

# Split the PDF into chunks
chunks = ____

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")
Code bewerken en uitvoeren