1. Preprocess text for training
Text preprocessing is crucial for distributed AI training, enabling large datasets to be processed in parallel across devices.
2. Text transformation: preparing data for model mastery
For our accessibility application, we'd like to provide concise document summaries for people with low vision. We'll need to train a model to understand paraphrases, which are two sentences with the same meaning. The MRPC dataset consists of sentence pairs and a binary label to indicate whether the sentences are paraphrases.
3. Dataset structure
Generally, text datasets are nested dictionaries, but the specific keys depend on the dataset. We'll need to explore a dataset to see its structure, or refer to documentation, such as Hugging Face dataset cards. We can print the dataset to examine its structure, which shows that the MRPC dataset is a nested dictionary with train, validation, and test splits, with features in each split.
4. Manipulating the text dataset
We can access data in the train split using dictionary notation. Within the train split, we can extract specific features such as "sentence1," "sentence2," and "label" following the same notation. As an example, we can get a list of all sentence1 examples in the train split.
We need to preprocess the text dataset using a tokenizer so that a model can read it. We can load a pretrained tokenizer using AutoTokenizer for our Transformer model. This tokenizer will convert sentence pairs into a sequence of token IDs.
5. Define an encoding function
Next let's define a function to encode examples from our dataset. This function calls the tokenizer, extracting sentence1 and sentence2 from the training example. The truncation argument specifies to truncate inputs that are longer than the maximum length of 512 tokens. The tokenizer will pad short sequences with zeros to ensure all inputs have the same length.
6. Format column names
We'll apply the encode function to each example in the train split using the map function.
After encoding, we'll specify which features to include in the model. We need to rename the "label" column to "labels," since this particular model expects a column called "labels." More generally, we can find which columns a model expects by referring to its model card and documentation on Hugging Face.
7. Saving and loading checkpoints
After formatting, we place the dataset on available GPUs using accelerator.prepare() to enable parallel processing; prepare() works with any PyTorch dataset (of type torch.utils.data.Dataset) in a DataLoader.
Preprocessing text can be time-consuming. Interruptions or system failures cause the entire process to restart, wasting resources. To address these challenges, we can save a checkpoint, which is the state of the preprocessed text, consisting of tokenized inputs and indicators of which inputs are relevant, called attention masks. We call accelerator.save_state(), which saves data in a specified directory. Later we can load the checkpoint to resume training using accelerator.load_state().
8. Let's practice!
The floor is yours to practice preprocessing text for training!