Preprocessing data for fine-tuning

1. Preprocessing data for fine-tuning

Let's go over how to preprocess datasets for fine-tuning Llama models.

2. Using datasets for fine-tuning

When fine-tuning, the results are largely dependent on the quality of the data. Datasets are usually split into three parts: the training set, for model training using most of the data;

3. Using datasets for fine-tuning

the validation set, a separate set to adjust parameters and select the best version;

4. Using datasets for fine-tuning

and the test set, to objectively evaluate the model's performance.

5. Preparing data using the datasets library

The Datasets library by Hugging Face provides a comprehensive collection of datasets for LLM tasks, and it streamlines the process of working with text datasets by providing functionalities for preprocessing, splitting, loading, and efficient memory management.

6. Loading a customer service dataset

Let's prepare a customer service dataset. The load_dataset function retrieves the dataset from the Hugging Face Hub and splits it into predefined subsets, such as 'train'. Usually, we'll use this feature to predefine the data's training, validation, and test sets. Once loaded, we can take a look at the dataset's columns relevant to our task, such as 'instruction' and 'response'.

7. Peeking into the data

Let's inspect the first data point. Each point is a dictionary of column names and values. We are interested in the instruction and response columns for this use-case, as they contain the customer service question and target response.

8. Filtering the dataset

To filter dataset samples, we import load_dataset and the Dataset class. The Dataset class stores datasets and provides utilities for converting data to a Dataset object. We check the data shape with the shape attribute, and filter the first thousand elements using Python's list syntax, which stores them as a dictionary of lists. We then convert this dictionary back to a Dataset object using from_dict.

9. Preprocessing the dataset

To prepare the dataset for fine-tuning, we combine the 'instruction' and 'response' fields into a single 'conversation' field. Using a custom function, we format each example as a labeled query and response pair. Then, we apply this function to the dataset using the 'map' method. This ensures the model sees a coherent input-output structure during training.

10. Saving the preprocessed dataset

After preprocessing, we save the modified dataset to disk using the 'save_to_disk' method. Saving the dataset locally allows us to reuse it for multiple experiments, including fine-tuning with TorchTune. This step is essential for efficient data handling, especially when working with large datasets. To retrieve the dataset we can use the 'load_from_disk' function from Hugging Face's datasets, along with the file path we initially assigned to the dataset.

11. Using Hugging Face datasets with TorchTune

One of TorchTune's features is that it can read Hugging Face datasets directly. After preparing our dataset, we can fine-tune it with TorchTune from the command line, by specifying a "dataset" path and other configurations such as the split to work on, which is usually 'train' or 'test'. In this example, TorchTune takes the Hugging Face dataset file and applies the recipe and configuration.

12. Let's practice!

Let's preprocess some data!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.