Evaluation metrics for text generation
1. Evaluation metrics for text generation
Great work! Let's evaluate our text generation models.2. Evaluating text generation
Text generation aims to create human-like text, posing a unique evaluation challenge where traditional metrics like accuracy or F1 score can fall short. Instead, we evaluate the quality and relevance of the generated text using metrics like BLEU and ROUGE, which compare it to reference texts, evaluating its quality more closely with how humans perceive language.3. BLEU (Bilingual Evaluation Understudy)
To do this, we employ BLEU (Bilingual Evaluation Understudy), which compares the generated text with a reference text by examining the occurrence of n-grams. But what's an n-gram? In a sentence like 'the cat is on the mat', the 1-grams or uni-grams are each individual word, the 2-grams or bi-grams are 'the cat', 'cat is', and so on. The more the generated n-grams match the reference n-grams, the higher the BLEU score. A perfect match results in a score of 1-point-0, while zero would mean no match.4. Calculating BLEU score with PyTorch
To calculate the BLEU score with PyTorch, we import BLEUScore from torchmetrics-dot-text. We initialize our predicted and target texts. In this case, we compare the generated text 'the cat is on the mat' with two reference texts. We initialize and calculate the BLEU score and call the instance of bleu by passing generated text and real text. The resulting score is approximately 0-point-76, representing the average precision of the n-grams in the generated text that also appear in the reference text.5. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) assesses generated text against reference text in two ways. ROUGE-N examines overlapping n-grams, with N representing the n-gram order. ROUGE-L checks for the longest common subsequence (LCS), the longest shared word sequence between the generated and reference text. ROUGE has three metrics: F-measure, precision, and recall. F-measure is the harmonic mean of precision and recall. Precision checks matches of n-grams in the generated text that are in the reference text. Recall checks for matches of n-grams in the reference text that appear in the generated text. The prefixes 'rouge1', 'rouge2', and 'rougeL' specify the n-gram order or LCS.6. Calculating ROUGE score with PyTorch
We import the ROUGEScore module from torchmetrics-dot-text to calculate the ROUGE Score. We define both generated and real text, where the real text represents the model's ideal output, and the generated text is the actual output. We then initialize the ROUGEScore module and apply it to our texts to obtain the ROUGE Score, displayed in the next slide.7. ROUGE score: output
In the ROUGE Score output, we first see the rouge1_fmeasure, precision, and recall. These are the F1 score, precision, and recall, respectively, based on single words or uni-grams in the text. Next, we have the rouge2_fmeasure, precision, and recall. These metrics consider two consecutive words or bi-grams in the text. Then we see rougeL_fmeasure, precision, and recall. The "L" stands for "longest", representing the longest common subsequence between the generated and real text. Lastly, rougeLsum_fmeasure, precision, and recall consider the longest matching sequences, accounting for all such sequences in the text and summing them up. Each of these provides a different perspective on the quality and similarity of the generated text. A score of 0-point-88 means that 88% of the generated text matches the real text.8. Considerations and limitations
As with most metrics, there are considerations to keep in mind. ROUGE and BLEU center around word presence without delving into semantic understanding. They are sensitive to the length of the generated text, and the quality and choice of reference text play a crucial role in the score outcomes.9. Let's practice!
Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.