Evaluate generated text using ROUGE

You are given 10 samples from a question-answering dataset (Softage-AI/sft-conversational_dataset).

You have used TinyLlama-1.1B to generate answers to these samples, and your task is to evaluate the quality of the generated results to the ground truth.

The answers generated by this model are provided in test_answers and the ground truth in reference_answers. Use the ROUGE evaluation metrics to evaluate the quality of the model's generation.

Import evaluation class and metric (ROUGE metric).
Instantiate the evaluation class and load the ROUGE metric.
Run the evaluator instance with the given reference_answers and test_answers to compute the ROUGE scores.
Store, in final_score the score from the results that checks overlap of word pairs between the reference and generated answers.

Preparing for Llama fine-tuning

Fine-tuning with SFTTrainer on Hugging Face

Exercise

Evaluate generated text using ROUGE

Instructions