1. Introduction to RAG evaluation
Welcome back!
2. Types of RAG evaluation
Because our RAG architecture is made up of several processes, there are a few places where performance can be measured.
We can evaluate the retrieval process to check if the retrieved documents are relevant to the query, the generation process to see if the LLM hallucinated or misinterpreted the prompt, or the final output to measure the performance of the whole system.
Let's start with the final output.
3. Output accuracy: string evaluation
We can use LLMs to measure the correctness of the final output by comparing it to a reference answer. We'll assign the query, model's answer, and reference answer to compare with the following variables.
4. Output accuracy: string evaluation
To perform string evaluation, we need to define a prompt template and large language model to use for evaluation. The prompt template instructs the model to compare the strings and evaluate the model output for correctness, returning correct or incorrect. The model temperature is also set to zero to minimize variability.
5. Output accuracy: string evaluation
We initialize LangChainStringEvaluator from LangSmith, which is LangChain's platform for evaluating LLM applications. This evaluator first takes "qa", which sets the evaluator to assess correctness, and also the LLM and prompt template to use.
We then call the .evaluate_strings() method on the model prediction, reference answer, and input query to perform the evaluation.
6. Output accuracy: string evaluation
A score of zero indicates that predicted response was incorrect when compared to the reference answer, and we can see here that the model response was deemed incorrect, which makes sense on reviewing it again.
7. Ragas framework
Now let's explore further with the RAGAS framework.
RAGAS was designed to evaluate both the retrieval and generation components of a RAG application. We will cover one metric for each component: faithfulness and context precision.
8. Faithfulness
Faithfulness assesses whether the generated output represents the retrieved documents well.
It is calculated using LLMs to assess the ratio of faithful claims that can be derived from the context to the total number of claims.
Because faithfulness is a proportion, it is normalized to between zero and one, where a higher score indicates greater faithfulness.
9. Evaluating faithfulness
Ragas integrates nicely with LangChain, and the first step involves defining the models for the evaluator to use: one for generation and another for embeddings.
Next, we define an evaluation chain, passing it the faithfulness metric from ragas and the two models we defined.
10. Evaluating faithfulness
To evaluate a model's response, we instantiate the chain, passing it a dictionary with "question", "answer", and "contexts" keys.
"question" is the query sent to the RAG application, "answer" is the response, and "contexts" are the document chunks available to the model.
A perfect faithfulness score of one indicates that the model's response could be fully inferred from the context provided.
11. Context precision
Context precision measures how relevant the retrieved documents are to the query.
A context precision score closer to one means the retrieved context is highly relevant.
The only change we need to make to the faithfulness evaluation chain is to import and use the content_precision metric instead.
12. Evaluating context precision
The context_precision_chain similarly takes a dictionary with "question", "contexts", and "ground_truth" keys, representing the input query, the retrieved documents, and the ground truth document that should have been retrieved.
Printing the results, we can see that we achieved a high context precision, indicating that the retrieval process is returning highly relevant documents.
13. Let's practice!
Time to evaluate!