1. Testing
We're at the final stage of development.
2. LLM lifecyle: Testing
Testing involves establishing robust processes to determine the readiness for the operational phase.
3. Why do we need to test?
Testing is crucial because LLMs make mistakes, especially as applications become more complex.
Changes to one aspect of the application can affect others, impacting performance. Consequently, testing is vital for assessing the application's readiness for deployment.
In this video, we'll specifically address evaluating the application's output.
4. Traditional ML versus LLM application testing
In traditional supervised machine learning, we use labeled training data and testing data to evaluate how well the model handles new, unseen data. We measure this using metrics that focus on accuracy or how close the predictions are to the target.
In contrast, for LLM applications, we usually just need test data to make and test the application. Here, the focus is on evaluating the quality of the model's output on this test data.
5. Step 1: Building a test set
A comprehensive test set is essential to effectively evaluate our application. Building the test set was a continuous activity during development but should now be completed.
The test data must closely resemble real-world scenarios to ensure accurate assessment. This can include either labeled text data for precise evaluation, or unlabeled text data, to simulate typical inputs.
Various tools, including other LLMs, can help generate test data to facilitate this process.
6. Step 2: Choosing our metric
The second step is choosing the right metric. This depends on the application, but condensing it into a flowchart can guide us.
If the model's output has a correct answer, such as a target label or number, use machine learning metrics like accuracy to assess correctness.
7. Step 2: Choosing our metric
We can use text comparison metrics when there's no correct answer but we have a reference.
The aim is to mimic how humans evaluate similarity and quality in text. We have two options.
First, we can use statistical methods that compare the overlap between predicted and reference text.
Second, we have model-based methods, where pre-trained models assess similarity. A popular approach is using LLM-judges, LLMs designed to assess other LLMs, to gauge the similarity between the reference and predicted outputs.
8. Step 2: Choosing our metric
If there's no reference answer but we have human feedback, we can use feedback score metrics. One approach is to have humans rate the text on quality, relevance, or coherence, although this can be expensive.
Alternatively, we can use model-based methods to predict expected ratings based on past feedback or ask LLM judges if feedback was incorporated.
9. Step 2: Choosing our metric
Lastly, if there's no human feedback, we can use unsupervised metrics to assess text coherence, fluency, and diversity with statistical or model-based techniques.
10. Step 3: Define optional secondary metrics
In addition to primary metrics, it's beneficial to keep track of optional secondary metrics. These can relate to the text's characteristics, such as bias, toxicity and helpfulness.
Alternatively, they can pertain to operational characteristics of the application, like latency, total incurred cost, and memory usage.
This list is not exhaustive and is use-case dependent.
11. The development cycle
The combination of the test set and evaluation metric determines when our LLM application is deployment-ready.
12. The development cycle
At this stage, tests can either pass or fail.
13. The development cycle
If a test fails, we need to revisit previous activities covered.
14. The development cycle
If it passes, we're ready for deployment, marking the transition to the operational phase.
15. Let's practice!
Let's test this knowledge.