Evaluate generated text using ROUGE
You are given 10 samples from a question-answering dataset (Softage-AI/sft-conversational_dataset
).
You have used TinyLlama-1.1B
to generate answers to these samples, and your task is to evaluate the quality of the generated results to the ground truth.
The answers generated by this model are provided in test_answers
and the ground truth in reference_answers
. Use the ROUGE evaluation metrics to evaluate the quality of the model's generation.
This exercise is part of the course
Fine-Tuning with Llama 3
Exercise instructions
- Import evaluation class and metric (ROUGE metric).
- Instantiate the evaluation class and load the ROUGE metric.
- Run the evaluator instance with the given
reference_answers
andtest_answers
to compute the ROUGE scores. - Store, in
final_score
the score from the results that checks overlap of word pairs between the reference and generated answers.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the evaluation library from Hugging Face
import ____
# Instantiate your evaluate library and load the ROUGE metric
rouge_evaluator = ____.load(____)
# Fill in the method, and place your reference answers and test answers
results = rouge_evaluator.____
# Extract the ROUGE1 score from the results dictionary
final_score = results[____]
print(final_score)