Evaluate generated text using ROUGE
You are given 10 samples from a question-answering dataset (Softage-AI/sft-conversational_dataset
).
You have used TinyLlama-1.1B
to generate answers to these samples, and your task is to evaluate the quality of the generated results to the ground truth.
The answers generated by this model are provided in test_answers
and the ground truth in reference_answers
. Use the ROUGE evaluation metrics to evaluate the quality of the model's generation.
Cet exercice fait partie du cours
Fine-Tuning with Llama 3
Instructions
- Import evaluation class and metric (ROUGE metric).
- Instantiate the evaluation class and load the ROUGE metric.
- Run the evaluator instance with the given
reference_answers
andtest_answers
to compute the ROUGE scores. - Store, in
final_score
the score from the results that checks overlap of word pairs between the reference and generated answers.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Import the evaluation library from Hugging Face
import ____
# Instantiate your evaluate library and load the ROUGE metric
rouge_evaluator = ____.load(____)
# Fill in the method, and place your reference answers and test answers
results = rouge_evaluator.____
# Extract the ROUGE1 score from the results dictionary
final_score = results[____]
print(final_score)