1. Text splitting, embeddings, and vector storage
Welcome back!
2. Preparing data for retrieval
Previously, we learned how to load our documents for retrieval.
Next,
3. Preparing data for retrieval
we need split documents into chunks that can be quickly retrieved and integrated into the model prompt. This requires
4. Preparing data for retrieval
embedding these chunks so they can be retrieved based on their semantic similarity,
5. Preparing data for retrieval
and storing them in a vector store. Let's cover these final three,
6. Preparing data for retrieval
starting with splitting, sometimes called chunking.
7. Splitting
Ideally, documents are split into chunks that contain sufficient context to be useful to the LLM;
However, larger doesn't always mean better.
If the chunks are huge, retrieval will be slow, and the LLM may struggle to extract the most relevant context from the chunk to respond to the query.
The chunk_size parameter is used to control this balance.
Another parameter, chunk_overlap, is used to capture important information that may be lost around the boundaries between chunks.
8. CharacterTextSplitter
Let's try our first splitting method, CharacterTextSplitter, on this text string.
First, instantiate a splitter using the class. We specify the separator to split on; in our case, it will split the text with each new paragraph, with chunk_size below 100 and 10 overlapping characters.
9. CharacterTextSplitter
To apply the splitter, call the .split_text() method on the text.
Let's view the chunks and their lengths, which we obtain using a list comprehension.
Keep in mind that CharacterTextSplitter frequently creates chunks that lack sufficient context to be useful in retrieval, like the first chunk here. The method was also unable to create chunks that were all below the chunk_size by splitting by-paragraph.
10. RecursiveCharacterTextSplitter
We can improve on this with RecursiveCharacterTextSplitter. It takes a list of separators and recursively splits using each one, attempting to create chunks below chunk_size.
For example, if the first separator creates chunks that exceed 100 characters, it will be split further using the next separator.
11. RecursiveCharacterTextSplitter
RecursiveCharacterTextSplitter often preserves more context which will result in more coherent responses from our RAG application.
12. Splitting documents
Extending splitting strings to splitting documents requires just one change: swapping the .split_text() method to .split_documents().
13. Splitting documents
Each document has .page_content and .metadata attributes for extracting the respective information. Calculating the number of characters in each chunk, we can see that we were able to stay under the chunk_size.
14. Embedding and storage
Now we've split documents into chunks, let's embed and store them for retrieval.
15. What are embeddings?
Remember that embeddings are
16. What are embeddings?
numerical representations of text. Embedding models aim to capture the "meaning" of the text, and these numbers map the text's position in a high-dimensional, or vector space.
17. What are embeddings?
Vector stores are databases specifically designed to store and retrieve this high-dimensional vector data.
18. What are embeddings?
When documents are embedded and stored, similar documents are located closer together in the vector space. When the RAG application receives a user input, it will be embedded and used to query the database, returning the most similar documents.
19. Embedding and storing the chunks
We'll use an embedding model from OpenAI and store the vectors in a Chroma vector database.
Let's start by initializing the model. Then, to embed and store the chunks in one operation, we call the .from_documents() method on the Chroma class, passing the chunks and model.
Note that if we were embedding string chunks, we'd use the .from_texts() method instead.
20. Let's practice!
Time to practice splitting, embedding, and storing!