Get startedGet started for free

Model evaluation

1. Model evaluation

Welcome back. Let's dive into that last piece of generative AI development: model evaluation.

2. Why evaluate anyway?

Evaluation, in the context of generative AI models, is the process of assessing the performance and effectiveness of a model based on a set of defined parameters or tasks. If generative AI models are producing useful content, why bother to evaluate them at all? Quality evaluation serves several key purposes. First, evaluation measures progress as we train models longer or change their design. Second, evaluation allows model comparison rigorously so we can determine which models work best for which tasks. Finally, evaluation can benchmark generative AI against human performance. As AI capabilities evolve, understanding their strengths helps us determine how to best leverage both types of intelligences.

3. Evaluating generative AIs

So what are some effective ways to evaluate generative AI quality? Quantitative methods include discriminative model evaluation metrics and generative model-specific metrics. More human-centered metrics include comparisons with human performance and intelligent evaluation. Each method offers a unique perspective on the quality of a model's outputs, but each also has limitations that make it useful only in specific contexts.

4. Discriminative model evaluation techniques

Discriminative model evaluation techniques, such as accuracy, focus on well-defined tasks with clear metrics. We can also use these metrics when generative AI outputs can be clearly categorized, measuring quality as we would darts on a dartboard. Closer to the bullseye is better. These techniques are widely used and easy to calculate and compare. However, generative AI content often needs evaluation on multiple subjective criteria. For example, how does one measure the beauty of a painting? This makes it challenging to assess their performance using standard metrics.

5. Generative model-specific metrics

Despite this challenge, automated scores for generative AI do exist. They are made custom for specific tasks and quantify nuanced criteria, such as realism, diversity, and novelty. Several metrics are well-known, allowing for comparison. However, these metrics still fail to capture many subjective features of generated content, and they often cannot generalize across many models and tasks.

6. Human performance comparison

By comparing AI performance on standardized tests to human scores, we gain insights into a model's ability to interpret information and apply knowledge. This shows how the AI compares to human abilities and where it may be practically applicable. On the other hand, such comparisons can be unfair, as AI and humans have different strengths.

7. Award-winning AIs

AI is now able to compete in a variety of human competitions, and even win. For example, this artwork generated by Stable Diffusion won a State Fair art competition. Also, generative AIs are increasingly able to perform better than most humans on difficult standardized tests, as we can see by this performance of a generative language model GPT-4 upon its launch. It shows GPT-4 surpassing the average human in a variety of tests made for humans!

8. The gold standard

Since generative AIs generate content ultimately used by humans or other AIs, evaluation by such intelligences is the gold standard. This method can capture all the subjective aspects, as it evaluates outputs in the context of user needs. But acquiring human evaluations is slow, often costly, and difficult to standardize. Moreover, it adds human bias and irregularity to the evaluation process.

9. Turing's classic test

A well-known example of human evaluation is the Turing Test. Proposed by computer scientist Alan Turing, it is a classic way to evaluate AI-generated content. In its original formulation, a human evaluator has a text-based, back-and-forth conversation with both a human and an AI. If the evaluator cannot distinguish between the AI and human-generated content, then the AI passes the test. However, this evaluation gets criticized, as human behavior can be unintelligent while intelligent behavior can appear inhuman.

10. Let's practice!

Let's put these ideas into action.