BaşlayınÜcretsiz Başlayın

PPO fine-tuning

After having initialized the trainer, you now have to initialize the loop to fine-tune the model.

The reward trainer ppo_trainer has been initialized using the PPOTrainer class from the trl python library.

Bu egzersiz

Reinforcement Learning from Human Feedback (RLHF)

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Generate response tensors using the input ids, and the trainer within the PPO loop.
  • Complete the step within the PPO loop that uses queries, response, and reward data to optimize the PPO model.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

for batch in tqdm(ppo_trainer.dataloader): 

    # Generate responses for the given queries using the trainer
    response_tensors = ____(batch["input_ids"])

    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    texts = [q + r for q, r in zip(batch["query"], batch["response"])]

    rewards = reward_model(texts)

    # Training PPO step with the query, responses ids, and rewards
    stats = ____(batch["input_ids"], response_tensors, rewards)

    ppo_trainer.log_stats(stats, batch, rewards)
Kodu Düzenle ve Çalıştır