Text splitting, embeddings, and vector storage

1. Text splitting, embeddings, and vector storage

Welcome back!

2. Preparing data for retrieval

Previously, we learned how to load our documents for retrieval. Next,

3. Preparing data for retrieval

we need split documents into chunks that can be quickly retrieved and integrated into the model prompt. This requires

4. Preparing data for retrieval

embedding these chunks so they can be retrieved based on their semantic similarity,

5. Preparing data for retrieval

and storing them in a vector store. Let's cover these final three,

6. Preparing data for retrieval

starting with splitting, sometimes called chunking.

7. Splitting

Ideally, documents are split into chunks that contain sufficient context to be useful to the LLM; However, larger doesn't always mean better. If the chunks are huge, retrieval will be slow, and the LLM may struggle to extract the most relevant context from the chunk to respond to the query. The chunk_size parameter is used to control this balance. Another parameter, chunk_overlap, is used to capture important information that may be lost around the boundaries between chunks.

8. CharacterTextSplitter

Let's try our first splitting method, CharacterTextSplitter, on this text string. First, instantiate a splitter using the class. We specify the separator to split on; in our case, it will split the text with each new paragraph, with chunk_size below 100 and 10 overlapping characters.

9. CharacterTextSplitter

To apply the splitter, call the .split_text() method on the text. Let's view the chunks and their lengths, which we obtain using a list comprehension. Keep in mind that CharacterTextSplitter frequently creates chunks that lack sufficient context to be useful in retrieval, like the first chunk here. The method was also unable to create chunks that were all below the chunk_size by splitting by-paragraph.

10. RecursiveCharacterTextSplitter

We can improve on this with RecursiveCharacterTextSplitter. It takes a list of separators and recursively splits using each one, attempting to create chunks below chunk_size. For example, if the first separator creates chunks that exceed 100 characters, it will be split further using the next separator.

11. RecursiveCharacterTextSplitter

RecursiveCharacterTextSplitter often preserves more context which will result in more coherent responses from our RAG application.

12. Splitting documents

Extending splitting strings to splitting documents requires just one change: swapping the .split_text() method to .split_documents().

13. Splitting documents

Each document has .page_content and .metadata attributes for extracting the respective information. Calculating the number of characters in each chunk, we can see that we were able to stay under the chunk_size.

14. Embedding and storage

Now we've split documents into chunks, let's embed and store them for retrieval.

15. What are embeddings?

Remember that embeddings are

16. What are embeddings?

numerical representations of text. Embedding models aim to capture the "meaning" of the text, and these numbers map the text's position in a high-dimensional, or vector space.

17. What are embeddings?

Vector stores are databases specifically designed to store and retrieve this high-dimensional vector data.

18. What are embeddings?

When documents are embedded and stored, similar documents are located closer together in the vector space. When the RAG application receives a user input, it will be embedded and used to query the database, returning the most similar documents.

19. Embedding and storing the chunks

We'll use an embedding model from OpenAI and store the vectors in a Chroma vector database. Let's start by initializing the model. Then, to embed and store the chunks in one operation, we call the .from_documents() method on the Chroma class, passing the chunks and model. Note that if we were embedding string chunks, we'd use the .from_texts() method instead.

20. Let's practice!

Time to practice splitting, embedding, and storing!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.