Model metrics and adjustments

1. Model metrics and adjustments

Hello! Let's explore how to use and understand reward metrics and adjust them to meet our goals.

2. Why use a reference model?

So far, we've seen that we can train a model using RL based on the rewards or feedback it gets. But what happens if the model learns to cheat the system and starts producing strange or meaningless outputs? For example, the model might figure out that using lots of emojis can result in more positive feedback from humans, without improving the actual quality of the text outputs. This would trick the reward system to keep using emojis even when they're not meaningful.

3. Checking model output

Let's go back to the whole RLHF process. Remember that we mentioned we'd add a check of the response, so that the model wouldn't deviate too much from an initial version.

4. Solution: KL divergence

This is done through a method called KL divergence penalty, which stands for Kullback–Leibler divergence.

5. Solution: KL divergence

This penalty is added to the reward to prevent the model from going too far off track. If the model starts producing unrelated outputs, the penalty pulls it back toward the reference. KL divergence measures the difference between two probability distributions. In RLHF training, it helps us compare the current model with the reference model. We aim to keep the KL divergence between 0 and 10 to ensure that the new model's output stays close to what the reference model would generate. It should never be negative, as that would mean the model would be moving away from learning the right reward function.

6. Adjusting parameters

To avoid issues with KL divergence, we should set top-k sampling to 0, as non-zero values can make the top choices less likely than they should be. We should also set min_length to -1, as this allows the token generation to end as soon as the model predicts the end-of-sequence. Setting these parameters helps the model stay closer to the reference model and generate more accurate text. After the parameters are defined, they can be passed to the policy model.

7. Checking the reward model

Another way to avoid KL divergence is to check the reward model. If the reward isn't improving over time, there might be an issue. This can be done by checking the output, generated in the form of a score for each text output, or reward. Let's consider a dataset of social media comments that is being used to train a sentiment analysis model.

8. Checking the reward model

We should examine extreme cases to ensure it's working correctly: for example, we should verify that extremely positive and negative examples receive different rewards. We should also examine the distribution of our dataset, and ensure it contains a balanced set of examples. Lastly, we should consider normalizing the reward model to prevent instability during training. Rewards can sometimes be too large or small, leading to inconsistent updates to the model.

9. Let's practice!

Now that we've learned about reward metric adjustment, let's bring it all together and put it into practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.