Get startedGet started for free

Splitting by tokens

Splitting documents using RecursiveCharacterTextSplitter or CharacterTextSplitter is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.

In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document.

tiktoken and all necessary classes have been imported for you.

This exercise is part of the course

Retrieval Augmented Generation (RAG) with LangChain

View Course

Exercise instructions

  • Get the encoding for gpt-4o-mini from tiktoken so you can check the number of tokens in each chunk.
  • Create a text splitter to split based on the number of tokens using the GPT-4o-Mini encoding.
  • Split the PDF, stored in document, into chunks using token_splitter.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Get the encoding for gpt-4o-mini
encoding = ____

# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)

# Split the PDF into chunks
chunks = ____

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")
Edit and Run Code