Get startedGet started for free

Metrics for language tasks: ROUGE, METEOR, EM

1. Metrics for language tasks: ROUGE, METEOR, EM

Excellent evaluation. Let's keep going.

2. LLM tasks and metrics

ROUGE score is the most popular metric for evaluating text summarization, with BLEU being useful for this task too.

3. LLM tasks and metrics

Both BLEU and METEOR are suitable metrics to evaluate translations, we've seen BLEU in action here already.

4. LLM tasks and metrics

And question-answering normally uses a combination of Exact Match and F1 scores for extractive QA, whereas BLEU and ROUGE scores are preferred in generative QA.

5. ROUGE

Let's examine the ROUGE score, which measures similarity between model-generated and reference summaries by analyzing n-gram co-occurrences and word overlap. Here, we have two similar sentences showing a bigram overlap, where n is two.

6. ROUGE

We'll evaluate some a summary about exoplanets. ROUGE will provide us a set of metric scores capturing different aspects of text similarity, such as unigram (one-word) and bigram (two-word) overlap, longer common subsequences, and more.

7. ROUGE outputs

Here is the ROUGE score evaluating our exoplanet text. The scores will be between 0 and 1, with a higher score indicating higher similarity. RougeLSum is a variant of RougeL specific for summarization tasks.

8. METEOR

METEOR incorporates more linguistic features into evaluation, such as variations in words through stemming, capturing words with similar meanings, and penalizing errors in word order. Let's compare it to BLEU in a translation scenario. Our prediction variable contains an LLM output as an English translation of a passage from the famous Spanish book "Don Quijote", and a reference translation of the same passage.

9. METEOR

By computing BLEU and METEOR scores, we see how BLEU has a comparatively lower score suggesting lower similarity in text. The METEOR score, also between 0 and 1, indicates there is still semantic alignment in the translation.

10. Question and answering

Lastly, on to question-answering. This task normally uses a combination of Exact Match and F1 scores for extractive QA, whereas BLEU and ROUGE scores are preferred in generative QA. Recall extractive QA answers a question with an extract or label, while generative QA generates a full textual answer.

11. Exact Match (EM)

Let's focus on extractive QA to introduce exact match, or EM. Given a list of answers extracted by our model, and their associated reference answers, EM returns 1 when the model output exactly matches its associated reference answer, or 0 otherwise. Here is an example of its use on three example answers collected from an extractive QA model. Only the second answer, "Theaters are great", fully matches the reference, resulting in a score of 0.3 repeating, or one third of the references being an exact match. Due to being a highly sensitive metric, it is frequent to use EM together with F1 scores, rather than in isolation.

12. Let's practice!

That was a lot of new metrics! Let's familiarize ourselves with them through some practice.