Preparare l'insieme di dati delle preferenze

In questo esercizio lavorerai con un insieme di dati che contiene feedback umano sotto forma di output "chosen" e "rejected". Il tuo compito è estrarre i prompt dalla colonna "chosen" e preparare i dati per addestrare un modello di ricompensa.

La funzione load_dataset da datasets è già stata importata

Questo esercizio fa parte del corso

Reinforcement Learning from Human Feedback (RLHF)

Visualizza il corso

Istruzioni dell'esercizio

Carica l'insieme di dati trl-internal-testing/hh-rlhf-helpful-base-trl-style da Hugging Face.
Scrivi una funzione che estragga il prompt dal campo 'content', assumendo che il prompt si trovi all'indice 0 dell'input della funzione.
Applica la funzione che estrae il prompt al sottoinsieme di dati 'chosen'.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

# Load the dataset
preference_data = ____

# Define a function to extract the prompt
def extract_prompt(text):
    ____
    return prompt

# Apply the function to the dataset 
preference_data_with_prompt = ____(
    lambda sample: {**sample, 'prompt': ____(sample['chosen'])}
)

sample = preference_data_with_prompt.select(range(1))
print(sample['prompt'])

Modifica ed esegui il codice

Questo esercizio fa parte del corso

Reinforcement Learning from Human Feedback (RLHF)

AvançadoNível de habilidade

4.8+

Inizia il corso gratis

This chapter introduces the basics of Reinforcement Learning with Human Feedback (RLHF), a technique that uses human input to help AI models learn more effectively. Get started with RLHF by understanding how it differs from traditional reinforcement learning and why human feedback can enhance AI performance in various domains.

Exercise 1: Introduzione a RLHF Exercise 2: Generazione di testo con RLHF Exercise 3: Classificare il testo generato per RLHF Exercise 4: RL vs. RLHF Exercise 5: Esplorare gli LLM pre-addestrati Exercise 6: Tokenizza un insieme di dati testuale Exercise 7: Fine-tuning per la classificazione delle recensioni Exercise 8: Preparare i dati per RLHF Exercise 9: Preparare l'insieme di dati delle preferenze

Esercizio in corso

Exercise 10: Estrazione dei prompt

Discover how to set up systems for gathering human feedback in this Chapter. Learn best practices for collecting high-quality data, from pairwise comparisons to uncertainty sampling, and explore strategies for enhancing your data collection.

Exercise 1: Methods for high-quality feedback gathering Exercise 2: Understanding comparison and rating in RLHF Exercise 3: Comparing slogans for a gym campaign Exercise 4: Measuring feedback quality and relevance Exercise 5: Low confidence Exercise 6: K-means for feedback clustering Exercise 7: Active learning Exercise 8: Implementing an active learning pipeline Exercise 9: Active learning loop

In this Chapter, you'll get into the core of Reinforcement Learning from Human Feedback training. This includes exploring fine-tuning with PPO, techniques to train efficiently, and handling potential divergences from your metrics' objectives.

Exercise 1: Reward models explored Exercise 2: Initializing the reward Exercise 3: Setting up the reward trainer Exercise 4: Training with PPO Exercise 5: Initialize the PPO trainer Exercise 6: PPO fine-tuning Exercise 7: Efficient fine-tuning in RLHF Exercise 8: Prepare for 8-bit Training Exercise 9: Train with LoRA

Explore key techniques for assessing and improving model performance in this last Chapter of Reinforcement Learning from Human Feedback (RLHF): from fine-tuning metrics to incorporating diverse feedback sources, you'll be provided with a comprehensive toolkit to refine your models effectively.

Exercise 1: Model metrics and adjustments Exercise 2: Mitigating negative KL divergence Exercise 3: Checking the reward model Exercise 4: Incorporating diverse feedback sources Exercise 5: Majority voting on multiple data sources Exercise 6: Unreliable data source identification Exercise 7: Evaluating RLHF models Exercise 8: Interpreting curves Exercise 9: Evaluating RLHF with metrics Exercise 10: Wrapping up your RLHF journey