Optimizing document retrieval

1. Optimizing document retrieval

Now that we've upgraded our document loading and splitting, let's see how the document retrieval process can be optimized.

2. Putting the R in RAG...

So far, our document retrieval has consisted of a vector database containing embedded documents. The input to the RAG application is then used to query the vectors, using a distance metric to determine which vectors are closest and therefore most similar and relevant. This type of retrieval is known as

3. Retrieval methods

dense retrieval. Dense retrieval methods encode the entire chunk as a single vector that is said to be "dense", that is, most of its component values are non-zero. Dense retrieval excels at capturing semantic meaning, but the embeddings can be computationally intensive to create and query, and may struggle with capturing rare words or highly specific technical terms.

4. Retrieval methods

There's also sparse retrieval, which is a method of finding information by matching specific keywords or terms in a query with those in documents. The resulting vectors contain many zeros, with only a few non-zero terms, which is why they are said to be "sparse". Sparse retrieval allows for precise retrieval, matching on exact words, the resulting vectors are also more explainable due to the alignment with specific terms, and rare words are better represented in the embeddings. The trade-off here is that sparse retrieval methods are less generalizable, as they aren't extracting the semantic meaning from the text.

5. Sparse retrieval methods

TF-IDF and BM25 are the two popular methods for encoding text as spare vectors. TF-IDF, or Term Frequency-Inverse Document Frequency, creates a sparse vector that measures a term's frequency in a document and rarity in other documents. This helps in identifying words that best represent the document's unique content. BM25, or best matching 25, is an improvement on TF-IDF that prevents high-frequency words from being over-emphasized in the encoding. Let's try out BM25 for RAG retrieval.

6. BM25 retrieval

The BM25Retriever class can be used to create a retriever from documents or text, just like the retrievers we have already used. Let's start with a small example of three statements about Python. We can use the .from_texts() method to create the retriever from these strings. The k value sets the number of items returned by the retriever when invoked.

7. BM25 retrieval

Invoking the retriever with an input and returning the page content of the most relevant result according to BM25, reveals that the statement about when Python was created was correctly returned. Looking at all three statements again, we can see that BM25 returned the statement with similar terms to the input that were also unique to the other statements. Now that we've tested out the BM25 retriever, let's integrate it into RAG.

8. BM25 in RAG

We'll create a RAG system to integrate a DataCamp blog post on RAG with an LLM. The first step is the same as before, but using the .from_documents() method as we're dealing with document chunks and not strings this time. Then, we use the same LCEL syntax as a standard dense retrieval RAG to integrate the retriever with a prompt template and LLM. Remember that RunnablePassthrough allows us to insert the input unchanged into the chain.

9. BM25 in RAG

Finally, we can invoke the chain on the input and return the results.

10. Let's practice!

Now, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.