Get startedGet started for free

Metrics for language tasks: perplexity and BLEU

1. Metrics for language tasks: perplexity and BLEU

Nice work! Classical metrics like accuracy and F1 are useful but limited for complex language tasks. This is where specialized metrics come in.

2. LLM tasks and metrics

Two popular metrics in text generation are perplexity and BLEU score.

3. Perplexity

Perplexity measures the model's accuracy and confidence in predicting the next word in a sentence or sequence. In general, a lower perplexity score indicates higher confidence in the predictions. Here is a stripped-back example of a model generating text based on input text. The input text is the start of a sentence about research in Antarctica, and the generated text completes the sentence. To see the generated text, we need to turn the input text into token ids, then back into text for the output. This is achieved with .encode() to apply the tokenizer, .generate() to obtain the token ids of the generated text, and .decode() to convert these ids into human readable text. The generation model, typically a BERT based model, provided probabilities for generating each word in the sequence during text generation. Perplexity uses these probabilities to calculate the model's confidence in the predictions.

4. Perplexity output

We load perplexity with the evaluate library, specifying the module type as metric. Next, we use the .compute() method, specifying the predictions, the generated text, and the model used to compute the metric. The output is a dictionary of perplexity scores for each text input, provided as a list, abbreviated here. When multiple generated text predictions are passed, it is common to assess their average perplexity. We look at the mean_perplexity, which, helpfully, is one of the dictionary keys. The result is heavily based on the text on which the model was trained. Note that for the metric result to be interpretable, we should compare it to baseline results.

5. BLEU

BLEU can also be used to evaluate text generation, as well as summarization and translations. It measures the quality of an LLMs outputs against some references provided by humans. To use the metric, these LLM predictions and human references need to be stored into variables. Let's load the BLEU and try it out using the same input text as before, but now with human references.

6. BLEU output

Notice how the generated text is encapsulated as a list before being passed as the predictions argument of the metric's compute method. BLEU's output will be a score between 0 and 1, indicating how similar the translation is to the reference. A value closer to 1 means higher similarity. In our case, we had a perfect match with a reference so the score is 1.

7. Let's practice!

Time to practice.