Advanced splitting methods

1. Advanced splitting methods

Now let's level-up our document splitting!

2. Limitations of our current splitting strategies

Our current approach for splitting documents with character text splitters has a few limitations. These splits are effectively "naive" because they are executed without considering the context of the surrounding text. This means related information will potentially be stored and processed separately, which will lower the quality of our RAG application. Splits are also made using characters rather than tokens. Recall that the language models we're using, break text into tokens, or smaller units of text, for processing. If we split documents using characters rather than tokens, we risk retrieving chunks and creating a retrieval prompt that exceeds the maximum amount of text the model can process at once, also called the model context window. We'll introduce methods to make our splitter more aware of the document's context and enable splitting with tokens.

3. Splitting on tokens

When we use a character text splitter with a chunk_size and chunk_overlap, we get chunks containing groups of characters that satisfy the chunking parameters.

4. Splitting on tokens

When we split on tokens, the chunk_size and chunk_overlap refer to the number of tokens in the chunk, rather than characters, so a chunk_size of five means we can have a maximum of five tokens in the chunk.

5. Splitting on tokens

Here, there are five tokens in the first chunk, each colored differently, four in the second chunk, and a two-chunk overlap.

6. Splitting on tokens

The TokenTextSplitter can be used to perform token splitting. It requires the name of the encoding to use, which is the encoding used by the large language model and can be retrieved using the tiktoken.encoding_for_model() method, and extracting the name with the .name attribute. Remember, the chunk_size and chunk_overlap now represent tokens rather than characters. We'll use the .split_text() method to split the example_string and view the chunks.

7. Splitting on tokens

This looks good, but were we able to keep to the chunk_size of 10 tokens? Let's check!

8. Splitting on tokens

We can loop through our chunks, encoding them into tokens with encoding.encode(), and calculating the number of tokens with len(). The first chunk contains 10 tokens, the second chunk has 6, and there is an overlap of two tokens ("fleece was"), so everything worked as expected! Let's now learn about a splitting strategy to split in a more context-aware way: semantic splitting.

9. Semantic splitting

Take this block of text for example.

10. Semantic splitting

A character or token text splitter will split naively, which results in lost context.

11. Semantic splitting

A semantic splitter will detect shifts in semantic meaning and perform the splits in those locations; in this example, when the discussion shifts from RAG to dogs.

12. Semantic splitting

To perform semantic splitting, we'll need an embedding model to generate text embeddings to determine the shift in topic. We'll use a model from OpenAI. We instantiate the semantic splitting class, passing the embedding model. We pass two additional parameters: breakpoint_threshold_type, which sets the metric at which embeddings are compared, and breakpoint_threshold_amount, which sets the metric's threshold at which to perform the split.

13. Semantic splitting

Like other splitters, we use the .split_documents() method to apply the splitter, in this case, to an academic paper. The semantic splitter reached the threshold of 0.8 and performed the splits; for the first chunk, splitting after the first two sentences of the abstract.

14. Let's practice!

Time to experiment with these new splitting methods!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.