Recursively splitting by character
Many developers are using a recursive character splitter to split documents based on a specific list of characters. These characters are paragraphs, newlines, spaces, and empty strings, by default: ["\n\n", "\n", " ", ""]
.
Effectively, the splitter tries to split by paragraphs, checks to see if the chunk_size
and chunk_overlap
values are met, and if not, splits by sentences, then words, and individual characters.
Often, you'll need to experiment with different chunk_size
and chunk_overlap
values to find the ones that work well for your documents.
This exercise is part of the course
Developing LLM Applications with LangChain
Exercise instructions
- Import the appropriate LangChain class for splitting a document recursively by character.
- Define a recursive character splitter to split on the characters
"\n"
," "
, and""
(in that order) with achunk_size
of24
andchunk_overlap
of10
. - Split
quote
, and print the chunks and chunk lengths.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the recursive character splitter
from langchain_text_splitters import ____
quote = 'Words are flowing out like endless rain into a paper cup,\nthey slither while they pass,\nthey slip away across the universe.'
chunk_size = 24
chunk_overlap = 10
# Create an instance of the splitter class
splitter = ____
# Split the document and print the chunks
docs = ____
print(docs)
print([len(doc) for doc in docs])