Splitting by tokens
Splitting documents using RecursiveCharacterTextSplitter
or CharacterTextSplitter
is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.
In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document
.
tiktoken
and all necessary classes have been imported for you.
This exercise is part of the course
Retrieval Augmented Generation (RAG) with LangChain
Exercise instructions
- Get the encoding for
gpt-4o-mini
fromtiktoken
so you can check the number of tokens in each chunk. - Create a text splitter to split based on the number of tokens using the GPT-4o-Mini
encoding
. - Split the PDF, stored in
document
, into chunks usingtoken_splitter
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Get the encoding for gpt-4o-mini
encoding = ____
# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)
# Split the PDF into chunks
chunks = ____
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")