PPO fine-tuning
After having initialized the trainer, you now have to initialize the loop to fine-tune the model.
The reward trainer ppo_trainer
has been initialized using the PPOTrainer
class from the trl
python library.
Este exercício faz parte do curso
Reinforcement Learning from Human Feedback (RLHF)
Instruções do exercício
- Generate response tensors using the input ids, and the trainer within the PPO loop.
- Complete the step within the PPO loop that uses queries, response, and reward data to optimize the PPO model.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
for batch in tqdm(ppo_trainer.dataloader):
# Generate responses for the given queries using the trainer
response_tensors = ____(batch["input_ids"])
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
rewards = reward_model(texts)
# Training PPO step with the query, responses ids, and rewards
stats = ____(batch["input_ids"], response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)