ComenzarEmpieza gratis

Splitting by tokens

Splitting documents using RecursiveCharacterTextSplitter or CharacterTextSplitter is convenient, and can give you good performance in some cases, but it does have one drawback: they split using characters as base units, rather than tokens, which are processed by the model.

In this exercise, you'll split documents using a token text splitter, so you can verify the number of tokens in each chunk to ensure that they don't exceed the model's context window. A PDF document has been loaded as document.

tiktoken and all necessary classes have been imported for you.

Este ejercicio forma parte del curso

Retrieval Augmented Generation (RAG) with LangChain

Ver curso

Instrucciones del ejercicio

  • Get the encoding for gpt-4o-mini from tiktoken so you can check the number of tokens in each chunk.
  • Create a text splitter to split based on the number of tokens using the GPT-4o-Mini encoding.
  • Split the PDF, stored in document, into chunks using token_splitter.

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

# Get the encoding for gpt-4o-mini
encoding = ____

# Create a token text splitter
token_splitter = ____(encoding_name=____, chunk_size=100, chunk_overlap=10)

# Split the PDF into chunks
chunks = ____

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk.page_content))}\n{chunk}\n")
Editar y ejecutar código