How to measure success

1. How to measure success

Welcome back. You've created the RAG app. You've parsed and split all of your data so that the Cortex search service can create the index and queries will be run against it. So now we have a working prototype. But how do we know if the service is giving us correct answers? How do we measure how successful a RAG is? To begin our experimentation, the first thing we need to do is register this version of our app. This gives us a way to keep track of our tests and allows us to capture metadata like application name and version. We'll do so by using TrueApp. We'll also need a TrueLens connector for this. This is easy. We'll just import it and instantiate it with our Snowpark session. Then our tracing and evaluation will be written to the database and schema associated with our Snowpark session. These are the ones we created earlier. We'll instantiate TrueApp with the app, an app name, and app version. We'll also add the TrueLens Snowflake connector. When we're conducting experiments with a RAG, we want to test it against the same dataset in a batch. This workflow allows us to measure improvements to our app against a common dataset. We'll call this batch a run. To create our run, first we need a dataset. This will hold the queries we want to test. If we want, we could also include some ground truth here. To add the run, we'll create our run config. We'll include more metadata to describe the data and the run for the experiment. Then we'll add it to our registered version of the app, TrueRAG. We can then describe it if we want. In the output here, you'll see all of the metadata we added in the run config, along with some other usable stuff, like what LM judge we'll use for computing metrics. Lastly, we'll start the run. Starting the run means our application will run inference and batch with all of our queries from our dataset. Now, how do we measure accuracy and correctness? What metrics should we use? The easiest way to measure accuracy is to have the answer key. This often is referred to as the ground truth or golden set. If we have the correct answers, we can check the app's answers against the ground truth. The name for this metric is correctness. This metric is great, especially early on with a small dataset, but we'll skip it here. When our dataset gets larger, we can add additional reference-free metrics. These are useful because we no longer need the ground truth data or answer key to check against. In particular for rags, we'll use the rag triad of context relevance, groundedness, and answer relevance. If you think about the information architecture of a rag, these metrics lie on each edge. Context relevance sits between the query and retrieve context and evaluates the quality of our search. Groundedness measures how well the LLM response sticks to the facts in the retrieve context, and this is also sometimes called faithfulness. Answer relevance sits between the query and the response, ensuring that the final answer is relevant to what the user was looking for. Now, how are these metrics actually computed? They rely on a relatively new technique called LLM as judge. LLM as judge operates just like it sounds. We can pass some text such as a question and answer, and then we'll also provide specific instructions and criteria for grading. With this, the LLM can provide a score against that criteria. You might ask, why should we trust an LLM judge? Good question. The short answer is that we can benchmark the LLM judge against human evaluations. The long answer, I've included the link in the reading so you can learn all about it if you're interested. Now, let's kick off our run evaluation using these metrics. To do this, we just need a single line of code, run.computemetrics, and we'll pass the list of metrics we want to compute. Remember all of the metadata we captured with at instrument about the span type and attributes? This is how they're used. Each eval is pre-configured to run against particular span attributes. Now, we don't have to specify what information each eval to use. Once we've kicked off the computation, let's go over to view the results. To view the eval results, we want to go over to the left pane under AI and ML. We'll then choose evaluations. From here, we want to open up the registered app name and app version. Now, we'll see a list of runs on this app version. To see results, we'll open up the run we just created so we can see each query in our dataset. We'll also see the evaluation results on the right side of the table. But wait, there's more. If we click on a particular trace or row, we can see even more information about this query. We can see all of the spans associated and the evaluation results. Let's click on a low-performing row to learn more about what went wrong. We chose this query about what the Fed Funds rate was in a particular time in 2024. This query has low context relevance, meaning that our retriever did not find relevant context for the LLM to answer our question. It also had low answer relevance because the LLM declined to answer the question. This is a direct result of the LLM not having the context it needs. Data freshness is a hard problem and seems to be the issue here. What we need to do now is update our RAG to contain the new data, documents starting around late 2024. The data is not fresh enough. That was a lot of information. In this video, you learned what metrics we can use to evaluate RAG and a useful workflow for conducting experiments. In our example, this workflow enable us to identify that our dataset was not fresh enough. Being aware of how these metrics work under the hood allows us to better understand how our RAGs are performing. With this, we will look at how to use this knowledge to improve our RAGs in the next video. Lots to do. See you in the next one.

2. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.