ComenzarEmpieza gratis

Initialize the PPO trainer

You are working for a customer service company that uses a chatbot to handle customer inquiries. The chatbot provides helpful responses, but you recently received feedback that they lack depth. You need to fine-tune the model behind the chatbot, and you start with creating a PPO trainer instance.

The dataset_cs has already been loaded.

Este ejercicio forma parte del curso

Reinforcement Learning from Human Feedback (RLHF)

Ver curso

Instrucciones del ejercicio

  • Initialize the PPO configuration with the model name "gpt2" and a learning rate of 1.2e-5.
  • Load AutoModelForCausalLMWithValueHead, the causal language model with a value head.
  • Create the PPOTrainer() using the model, configuration, and tokenizer just defined, and with the preloaded dataset.

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

from trl import PPOConfig, AutoModelForCausalLMWithValueHead, PPOTrainer
from transformers import AutoTokenizer

# Initialize PPO Configuration
gpt2_config = ____(model_name=____, learning_rate=____)

# Load the model
gpt2_model = ____(gpt2_config.model_name)
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_config.model_name)

# Initialize PPO Trainer
ppo_trainer = ____
Editar y ejecutar código