CommencerCommencer gratuitement

Evaluate generated text using ROUGE

You are given 10 samples from a question-answering dataset (Softage-AI/sft-conversational_dataset).

You have used TinyLlama-1.1B to generate answers to these samples, and your task is to evaluate the quality of the generated results to the ground truth.

The answers generated by this model are provided in test_answers and the ground truth in reference_answers. Use the ROUGE evaluation metrics to evaluate the quality of the model's generation.

Cet exercice fait partie du cours

Fine-Tuning with Llama 3

Afficher le cours

Instructions

  • Import evaluation class and metric (ROUGE metric).
  • Instantiate the evaluation class and load the ROUGE metric.
  • Run the evaluator instance with the given reference_answers and test_answers to compute the ROUGE scores.
  • Store, in final_score the score from the results that checks overlap of word pairs between the reference and generated answers.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Import the evaluation library from Hugging Face
import ____ 

# Instantiate your evaluate library and load the ROUGE metric
rouge_evaluator = ____.load(____) 

# Fill in the method, and place your reference answers and test answers
results = rouge_evaluator.____

# Extract the ROUGE1 score from the results dictionary
final_score = results[____]
print(final_score)
Modifier et exécuter le code