PPO fine-tuning
After having initialized the trainer, you now have to initialize the loop to fine-tune the model.
The reward trainer ppo_trainer
has been initialized using the PPOTrainer
class from the trl
python library.
This exercise is part of the course
Reinforcement Learning from Human Feedback (RLHF)
Exercise instructions
- Generate response tensors using the input ids, and the trainer within the PPO loop.
- Complete the step within the PPO loop that uses queries, response, and reward data to optimize the PPO model.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
for batch in tqdm(ppo_trainer.dataloader):
# Generate responses for the given queries using the trainer
response_tensors = ____(batch["input_ids"])
batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]
texts = [q + r for q, r in zip(batch["query"], batch["response"])]
rewards = reward_model(texts)
# Training PPO step with the query, responses ids, and rewards
stats = ____(batch["input_ids"], response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)