The evaluate library

1. The evaluate library

Well done fine-tuning! Before we wrap up, it's time to look at how to evaluate LLMs.

2. The evaluate library

We may be familiar with classical metrics, like accuracy, used to evaluate machine learning models. While these can also be applied to LLMs, language tasks usually require more comprehensive and task-specific metrics. Hugging Face's Evaluate library was created to address this need. It contains a collection of metrics to evaluate model performance based on ground truth, including detailed descriptions of each metric accessible through code. This is in addition to a set of tools to compare and measure differences between models and obtain insight from language datasets. We can load a metric with the .load() method and specify the metric name, such as accuracy. The description attribute provides a helpful description.

3. Features attribute

By accessing the features attribute of a metric, we can inspect the required inputs for its computation. Most metrics will often require two collections, predictions and references, containing model outputs and ground-truth, respectively. The data types supported are also specified, for instance integers for class labels in metrics like accuracy and F1 score, or float in other metrics like Pearson correlation.

4. LLM tasks and metrics

Throughout the chapter, we'll explore the metrics for the five most common language tasks addressed by LLMs: text classification, text generation, summarization, translation, and question-answering.

5. LLM tasks and metrics

Let's start with classification.

6. Classification metrics

Here we're loading four classification metrics: accuracy, precision, F1, recall, using the evaluate library. We can initialize a pipeline with a model and tokenizer and pass our evaluation data. We can then transform our predicted labels into integers using list comprehension.

7. Metric outputs

We use the compute method to evaluate five binary class predictions based on the ground-truth labels passed to the method as references. This example uses some pseudo labels. The results show that our predictions are correct 80% of the time, every positive prediction is correct, but there are some false negatives. This is acceptable in our case, but may be problematic in others.

8. Evaluating our fine-tuned model

Let's evaluate our fine-tuned model by loading it with the tokenizer, tokenizing new data, and generating predicted labels. Using real labels (1 for positive reviews), we achieve a perfect score; but remember, in real-world scenarios, perfect scores can indicate potential issues.

9. Choosing the right metric

The choice of appropriate metrics is crucial for meaningful evaluation, and understanding the limitations of certain metrics is essential for making that decision. For instance, using accuracy in isolation can be misleading on imbalanced datasets. We should aim to be comprehensive and evaluate our use case scenario using a combination of metrics and, in some cases, domain-specific success metrics like KPIs for a more comprehensive performance evaluation.

10. Let's practice!

For now, let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.