Splitting external data for retrieval

1. Splitting external data for retrieval

Now that we've loaded documents from different sources, let's learn how to parse the information.

2. RAG development steps

Document splitting splits the loaded document into smaller parts, which are also called chunks. Chunking is particularly useful for breaking up long documents so that they fit within an LLM's context window.

3. Thinking about splitting...

Let's examine the introduction from an academic paper, which is saved as a PDF. One naive splitting option would be to separate the document by-line. This would be simple to implement, but because sentences are often split over multiple lines, and because those lines are processed separately, key context might be lost.

4. Chunk overlap

To counteract lost context during chunk splitting, a chunk overlap is often implemented. We've selected two chunks and a chunk overlap shown in green. Having this extra overlap present in both chunks helps retain context. If a model shows signs of losing context and misunderstanding information when answering from external sources, we may need to increase this chunk overlap.

5. What is the best document splitting strategy?

There isn't one document splitting strategy that works for all situations. We should experiment with multiple methods, and see which one strikes the right balance between retaining context and managing chunk size. We will compare two document splitting methods: CharacterTextSplitter and RecursiveCharacterTextSplitter. Optimizing this document splitting is an active area of research, so keep an eye out for new developments!

6. Example text for chunk size comparison

As an example, let's split this quote by Elbert Hubbard, which contains 103 characters, into chunks. We'll compare how the two methods perform on this quote with a chunk_size of 24 characters and a small chunk_overlap of three.

7. CharacterTextSplitter to split documents

Let's start with CharacterTextSplitter. This method splits based on the separator first, then evaluates chunk_size and chunk_overlap to check if it's satisfied. We call CharacterTextSplitter, passing the separator to split on, along with the chunk_size and chunk_overlap. Applying the splitter to the quote with the .split_text() method, and printing the output, we can see that we have a problem: each of these chunks contains more characters than our specified chunk_size. CharacterTextSplitter splits on the separator in an attempt to make chunks smaller than chunk_size, but in this case, splitting on the separator was unable to return chunks below our chunk_size. Let's take a look at a more robust splitting method!

8. RecursiveCharacterTextSplitter

RecursiveCharacterSplitter takes a list of separators to split on, and it works through the list from left to right, splitting the document using each separator in turn, and seeing if these chunks can be combined while remaining under chunk_size. Let's split the quote using the same chunk_size and chunk_overlap.

9. RecursiveCharacterTextSplitter

Notice how the length of each chunk varies. The class split by paragraphs first, and found that the chunk size was too big; likewise for sentences. It got to the third separator: splitting words using the space separator, and found that words could be combined into chunks while remaining under the chunk_size character limit. However, some of these chunks are too small to contain meaningful context, but this recursive implementation may work better on larger documents.

10. RecursiveCharacterTextSplitter with HTML

We can also use split other file formats, like HTML. Recall that we can load HTML using UnstructuredHTMLLoader. Defining the splitter is the same, but for splitting documents, we use the .split_documents() method instead of .split_text() to perform the split.

11. Let's practice!

Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.