Evaluating Graph RAG with RAGAS

1. Evaluating Graph RAG with RAGAS

So now you have your Graph RAG system in place and you are ready to go into production, right? Not so fast!

2. Summary so far

For both the text-to-Cypher chain and vector or hybrid approaches, there are lots of components and processes that can underperform. It's important to adopt a robust evaluation framework that considers the entire workflow to mitigate as much risk as possible.

3. Evaluating Graph RAG

As we evaluate Graph RAG against other methods, there will always be a trade-off in terms of cost, time, or latency, and output quality. We can use Python's time module to compare runtimes for different Graph RAG configurations. Cost is also a key consideration before putting your RAG application into production. We can use the tiktoken library to calculate the costs associated with encoding the data in the context, and in the model's generated output. The Ragas library provides a robust RAG evaluation framework to assess the performance of the full pipeline. We will focus on the output quality here, but try experimenting with the time module and tiktoken independently. Ragas provides a number of metrics, but two that are especially relevant to Graph RAG are context precision and noise sensitivity.

4. Noise Sensitivity

Noise sensitivity measures the amount of irrelevant information, or noise, in the retrieved documents. We expect this to be higher with plain vector search as the retrieval is done by semantic similarity rather than relationships. We can import the NoiseSensitivity from ragas.metrics, and specify an LLM to perform the evaluation. The NoiseSensitivity metrics has two modes, relevant or irrelevant. With the irrelevant mode, the higher the value, the more irrelevant information is contained in the retrieved documents.

5. Context Precision

The relevance of the retrieved documents can be measured with the LLM-based context precision metrics. This metric compares the user input with the retrieved documents using an LLM to determine its relevance. If we have a reference that we can compare the retrieved documents to, we use the LLMContextPrecisionWithReference metric from ragas. If we don't,

6. Context Precision

we use the LLMContextPrecisionWithoutReference class. This is how you define ragas metrics, but to actually apply them, we need an evaluation dataset.

7. Text-to-Cypher result structure

Ragas requires an evaluation dataset consisting of the user input, the response, and the retrieved context used to generate the response. The retrieved contexts must be a list of strings, so for typical text-to-Cypher results like these,

8. Text-to-Cypher result structure

we need to convert the dictionary to a string with json.dumps().

9. Vector-only result structure

For a plain vector search, the retrieved context may only contain the lines spoken by the character, so they may already be in string form.

10. Hybrid result structure

For a hybrid search, retrieved texts can be stored under a "page_content" key, and any metadata can be captured under a metadata dictionary. This metadata can include node properties and relationships to other nodes. As with the text-to-Cypher data, we need json.dumps() to convert the nested dictionary into a string.

11. Creating an evaluation dataset

Ragas provides EvaluationDataset specifically for storing and working with evaluation datasets, and we can create one from our dictionary by wrapping it in a list, and calling .from_list() method.

12. Choosing an LLM for evaluation

We'll need to choose an LLM to run our evaluations with - you can choose any LangChain compatible LLM here. We use temperature=0 here to maximize the consistency in the evaluations - recall that a lower temperature means less creativity and more consistency. Finally, we wrap it in the ragas LangchainLLMWrapper class.

13. Evaluating responses

Let's now calculate the noise sensitivity and context precision metrics we defined earlier-on in our evaluation dataset. We can call the evaluate() function, specifying the dataset, and the list of metrics we defined earlier. The output of the method will be a dictionary of values corresponding the metrics that we have selected. A high context precision and low noise sensitivity indicates that the content being retrieved is highly relevant to the question being asked.

14. Let's practice!

Now it's your turn!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.