1. Reward models explored
Welcome back! We'll now explore the concept of reward models, and how to train them.
2. Process so far
So far, we've focused on how to set up our RLHF process to use good-quality, optimized human feedback, and have seen how to build a preference dataset that reflects human choices.
3. Process so far
We can now use this preference dataset to train a reward model. Let's take a look at how this works!
4. What is a reward model?
A reward model is designed to learn the quality of an output based on guiding principles that determine whether the output should be 'rewarded' or not, that we'll call a 'reward scheme'.
5. What is a reward model?
This model is informed by an agent observing its environment and executing activities guided by the scheme. The agent learns to maximize rewards by recognizing optimal actions.
The reward model uses human feedback to evaluate the actions, learning to recognize patterns that humans consider helpful and informative.
6. Using the reward trainer
To train a reward model using preference data, we use the RewardTrainer class from TRL, or Transformer Reinforcement Learning. TRL is a Hugging Face library that provides tools and functionality for training and fine-tuning transformer language models using reinforcement learning techniques, including reward modeling.
We'll load a model and tokenizer using the transformers library, and a dataset as the preference dataset. The dataset should contain pairs of examples, where each pair consists of "chosen" and "rejected" sequences.
7. Training the reward model
Next, we define the training arguments using RewardConfig from the trl library.
The output directory parameter sets the directory where the trained model and outputs will be saved.
The batch sizes determine the number of training and evaluation samples processed together in each batch.
The number of training epochs specifies how many times the entire training dataset will be passed through the model.
Lastly, the learning rate controls the rate at which the model's weights are updated based on the loss gradient. We can find its optimal value using hyperparameter tuning, or use a standard value such as 1 to the minus 3 to get started.
8. Training the reward model
We then initialize the RewardTrainer.
First, we pass in the model, which is the pre-trained model that will be fine-tuned.
Next, we include args, which contains the training arguments that we defined in the previous step.
We also specify the 'train_dataset', which is the dataset used for training the model, and the 'eval_dataset', used for evaluating the model during training.
Finally, we provide the tokenizer to preprocess the input text data.
After initializing the RewardTrainer with these arguments, we call dot train to start training the reward model. This process will fine-tune the model using the specified training arguments, datasets, and tokenizer.
9. Let's practice!
Let's practice training our reward model!