Get startedGet started for free

Validation

1. Validation

In this video we'll continue our journey to ensure that the AI systems we're building meet expected standards and provide good quality content to users.

2. Validation

With validation, we are testing the model's performance to uncover areas where the model might be prone to making mistakes.

3. Validation

Validation is aimed at uncovering a variety of cases that mimic its use in the real world, such as: misinterpreting context, amplifying biases in its outputs if the training data is biased, outputting outdated information, being manipulated to generate harmful or unethical content, or inadvertently revealing sensitive information.

4. Adversarial testing

A valid method for testing AI systems before they are released is adversarial testing. Adversarial testing involves in providing the model with prompts that are specifically designed to identify its areas of weakness so that they can be addressed before release. It is a technique used with other AI systems beyond large language models, where even a small change in input can produce an unwanted or wrong output.

5. Adversarial testing

Let's have a look at an example. Here we have some movie reviews, and are asking the model to extract the sentiment from each: positive, negative, or neutral. Here is an example of a movie review that seems to start on a positive tone but turns negative towards the end, which might make it more difficult for the model.

6. Adversarial testing

The model recognizes this nuance and gives the correct answer, which is that the review is negative.

7. Adversarial testing

This is another difficult example that we have selected to test the model, another film review: in this case it fails to recognize the sarcasm, categorizing the review as neutral.

8. Evaluation libraries and datasets

To evaluate models in a more structured way, researchers are developing libraries and datasets of standardized use cases that measure the models' performances in a variety of domains. The datasets represent a variety of use cases, and include metrics that can be used for the evaluation. These have been introduced to address issues found in previous evaluations where there was a disconnect between the benchmarks and the capabilities of the models in the real world. They provide assessments across a range of domains, including STEM, humanities, and social sciences with difficulties ranging from elementary to professional, and this has been shown to be a more effective representation of models' behavior in the real world.

9. Let's practice!

We've explored some key techniques for evaluating and improving the robustness of AI models. As we continue to integrate AI into various systems affecting our daily lives, the importance of rigorous testing cannot be overstated. So, keep testing, and let's practice building robust systems!