Recursively splitting documents
Splitting on a single character is simple and predictable, but it often produces sub-optimal chunks. In this exercise, you'll apply recursive character splitting to split the Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks paper you loaded in a earlier exercise.
Recall that recursive character splitting iterates over a list of characters, splitting on each in turn to see if chunks can be created beneath the chunk_size
limit.
This exercise is part of the course
Retrieval Augmented Generation (RAG) with LangChain
Exercise instructions
- Define a LangChain recursive character text splitter to split recursively through the character list
['\n', '.', ' ', '']
with a chunk size of75
and chunk overlap of10
. - Split
document
using thetext_splitter
you defined and an appropriate method.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
loader = PyPDFLoader("rag_paper.pdf")
document = loader.load()
# Define a text splitter that splits recursively through the character list
text_splitter = ____(
____,
chunk_size=75,
chunk_overlap=10
)
# Split the document using text_splitter
chunks = text_splitter.____
print(chunks)
print([len(chunk.page_content) for chunk in chunks])