Initialize the PPO trainer
You are working for a customer service company that uses a chatbot to handle customer inquiries. The chatbot provides helpful responses, but you recently received feedback that they lack depth. You need to fine-tune the model behind the chatbot, and you start with creating a PPO trainer instance.
The dataset_cs
has already been loaded.
This exercise is part of the course
Reinforcement Learning from Human Feedback (RLHF)
Exercise instructions
- Initialize the PPO configuration with the model name
"gpt2"
and a learning rate of1.2e-5
. - Load
AutoModelForCausalLMWithValueHead
, the causal language model with a value head. - Create the
PPOTrainer()
using the model, configuration, and tokenizer just defined, and with the preloaded dataset.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from trl import PPOConfig, AutoModelForCausalLMWithValueHead, PPOTrainer
from transformers import AutoTokenizer
# Initialize PPO Configuration
gpt2_config = ____(model_name=____, learning_rate=____)
# Load the model
gpt2_model = ____(gpt2_config.model_name)
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_config.model_name)
# Initialize PPO Trainer
ppo_trainer = ____