Splitting by tokens
Splitting documents using RecursiveCharacterTextSplitter or CharacterTextSplitter is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.
In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document.
tiktoken and all necessary classes have been imported for you.
Cet exercice fait partie du cours
Retrieval Augmented Generation (RAG) with LangChain
Instructions
- Get the encoding for
gpt-4o-minifromtiktokenso you can check the number of tokens in each chunk. - Create a text splitter to split based on the number of tokens using the GPT-4o-Mini
encoding. - Split the PDF, stored in
document, into chunks usingtoken_splitter.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Get the encoding for gpt-4o-mini
encoding = ____
# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)
# Split the PDF into chunks
chunks = ____
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")