Initializing the reward
You are in the final stages of deploying a generative model designed to offer personalized recommendations for an online bookstore. To align this model with human-preferred recommendations, you need to train a reward model using some collected preference data. The first step is to initialize the model and configuration parameters.
The AutoTokenizer
and AutoModelForSequenceClassification
were preloaded from transformers
. RewardConfig
was preloaded from trl
.
This exercise is part of the course
Reinforcement Learning from Human Feedback (RLHF)
Exercise instructions
- Load the GPT-1 model,
"openai-gpt"
, for the sequence classification task using Hugging Face'sAutoModelForSequenceClassification
. - Initialize the reward configuration using
"output_dir"
as the output directory, and set the token maximum length to60
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the pre-trained GPT-1 model for text classification
model = ____
tokenizer = AutoTokenizer.from_pretrained("openai-gpt")
# Initialize the reward configuration and set max_length
config = ____(output_dir=____, max_length=____)