1. Learn
  2. /
  3. Courses
  4. /
  5. Reinforcement Learning from Human Feedback (RLHF)

Connected

Exercise

PPO fine-tuning

After having initialized the trainer, you now have to initialize the loop to fine-tune the model.

The reward trainer ppo_trainer has been initialized using the PPOTrainer class from the trl python library.

Instructions

100 XP
  • Generate response tensors using the input ids, and the trainer within the PPO loop.
  • Complete the step within the PPO loop that uses queries, response, and reward data to optimize the PPO model.