ComenzarEmpieza gratis

PPO fine-tuning

After having initialized the trainer, you now have to initialize the loop to fine-tune the model.

The reward trainer ppo_trainer has been initialized using the PPOTrainer class from the trl python library.

Este ejercicio forma parte del curso

Reinforcement Learning from Human Feedback (RLHF)

Ver curso

Instrucciones del ejercicio

  • Generate response tensors using the input ids, and the trainer within the PPO loop.
  • Complete the step within the PPO loop that uses queries, response, and reward data to optimize the PPO model.

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

for batch in tqdm(ppo_trainer.dataloader): 

    # Generate responses for the given queries using the trainer
    response_tensors = ____(batch["input_ids"])

    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    texts = [q + r for q, r in zip(batch["query"], batch["response"])]

    rewards = reward_model(texts)

    # Training PPO step with the query, responses ids, and rewards
    stats = ____(batch["input_ids"], response_tensors, rewards)

    ppo_trainer.log_stats(stats, batch, rewards)
Editar y ejecutar código