Get startedGet started for free

Evaluate generated text using ROUGE

You are given 10 samples from a question-answering dataset (Softage-AI/sft-conversational_dataset).

You have used TinyLlama-1.1B to generate answers to these samples, and your task is to evaluate the quality of the generated results to the ground truth.

The answers generated by this model are provided in test_answers and the ground truth in reference_answers. Use the ROUGE evaluation metrics to evaluate the quality of the model's generation.

This exercise is part of the course

Fine-Tuning with Llama 3

View Course

Exercise instructions

  • Import evaluation class and metric (ROUGE metric).
  • Instantiate the evaluation class and load the ROUGE metric.
  • Run the evaluator instance with the given reference_answers and test_answers to compute the ROUGE scores.
  • Store, in final_score the score from the results that checks overlap of word pairs between the reference and generated answers.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the evaluation library from Hugging Face
import ____ 

# Instantiate your evaluate library and load the ROUGE metric
rouge_evaluator = ____.load(____) 

# Fill in the method, and place your reference answers and test answers
results = rouge_evaluator.____

# Extract the ROUGE1 score from the results dictionary
final_score = results[____]
print(final_score)
Edit and Run Code