Splitting by tokens
Splitting documents using RecursiveCharacterTextSplitter or CharacterTextSplitter is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.
In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document.
tiktoken and all necessary classes have been imported for you.
This exercise is part of the course
Retrieval Augmented Generation (RAG) with LangChain
Exercise instructions
- Get the encoding for
gpt-4o-minifromtiktokenso you can check the number of tokens in each chunk. - Create a text splitter to split based on the number of tokens using the GPT-4o-Mini
encoding. - Split the PDF, stored in
document, into chunks usingtoken_splitter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Get the encoding for gpt-4o-mini
encoding = ____
# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)
# Split the PDF into chunks
chunks = ____
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")