Preparing data for RLHF

1. Preparing data for RLHF

Welcome back to this video, where we'll explore how to prepare data for RLHF.

2. Preference vs. Prompt datasets

The RLHF process requires two additional datasets aside from the dataset used to fine-tune the initial LLM: a prompt dataset and a preference dataset. Let's take a look at where each of these comes in.

3. Preference vs. Prompt datasets

The prompt dataset is used to extract prompts to feed into the models,

4. Preference vs. Prompt datasets

while the preference dataset is used to train the reward model. The preference dataset is where the human element comes in: it's in this dataset that the human preferences are expressed.

5. Prompt dataset

Let's examine the prompt dataset first. It is essentially a collection of inputs, or "prompts", that are fed into the model. We can think of it as a set of questions or scenarios the model needs to respond to. It serves as the starting point for the model to generate output. We can look for prompt datasets on Hugging Face to obtain one. This one for example, can be loaded using load datasets, and already contains all the prompts as a column. Let's print one of them as an example. Depending on how the data is saved, we might need to extract the prompt from an existing conversation. In such cases, we need to look for specific characters in the dataset that mark the start and end of the prompt, such as 'input', 'text', 'human'.

6. Exploring the preference dataset

Let's now explore the preference dataset. This dataset helps train models to understand which output humans prefer from a set of choices. First, we'll load the preference dataset using the Hugging Face library. Here, we're using a subset of the Anthropic rlhf dataset. Each data point contains a chosen output and a rejected output, indicating which completion was preferred. This is vital for training the reward model.

7. Processing the preference dataset

In this dataset, the prompt is the first string in the list of strings that appear in the columns. To extract the prompt, we use a function that selects the first dictionary in the list and returns its content as the prompt. We can apply this custom function to our dataset using the map function, with lambda applied to each element in the preference data to unpack the data dictionary and add a new key, 'prompt', to it. Notice that every dataset is different, and when using other datasets, we might have to adapt the way we extract our prompts based on how the data is saved.

8. Final preference dataset

Now, let's pick a sample from the dataset by selecting a range of 1. At the moment, the prompt, the question about vitamins, is still included in the chosen column. Depending on how we plan to use the dataset, we can choose to either remove or keep the prompt in the final dataset. This is how the preference dataset looks. We use this data to train the reward model, fine-tuning it based on the human preferences indicated in the dataset.

9. Let's practice!

In the next part of the course, we'll explore how to gather high quality feedback before moving to training the reward and policy models. But now, let's practice preprocessing the prompt and preference datasets!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.