Preparing for fine-tuning

1. Preparing for fine-tuning

Welcome back. So far, we have used the pipeline() interface.

2. Pipelines and auto classes

It streamlines language tasks by automatically selecting a model and tokenizer but offers limited control. Auto classes allow more customization, enabling manual adjustments and model fine-tuning, which we'll cover next.

3. LLM lifecycle

The LLM development lifecycle is similar to other machine and deep learning models. However, LLM training has two phases: pre-training on a broad dataset to learn general language patterns,

4. LLM lifecycle

followed by fine-tuning on domain specific data to adapt it for specialized tasks. For instance, an insurance company's data used to fine-tune an LLM to handle insurance customer inquiries.

5. Loading a dataset for fine-tuning

To begin fine-tuning, we prepare the dataset to ensure the model learns the right patterns. The Hugging Face hub offers a datasets library with a vast collection great for experimentation. Our goal is to fine-tune a pre-trained model with the imdb data. Here, we load a movie review dataset using the load_dataset function, setting the split parameter to extract training and test data. We use .shard() to split this large dataset into four chunks and selecting the first chunk to save with index equals zero. We've done this to speed up training; the extraction process will differ depending on the data and computational needs.

6. Auto classes

Recall these two common auto classes in Hugging Face's transformer library: AutoModel and AutoTokenizer. There are also auto classes that support loading task-specific models, such as AutoModelForSequenceClassification, which are suitable for loading sentiment classification models. The from_pretrained method loads a specified pre-trained model with its learned weights and a suitable tokenizer for that model.

7. Tokenization

After loading and instantiating the data, model, and tokenizer, we tokenize the data subset in one go by selecting the text column, enabling padding and sequence truncation when exceeding the specified maximum length. This helps with efficiency. We set return_tensors to pt to return PyTorch tensors since our model expects this format.

8. Tokenization output

Printing the output will show us the truncated list of token IDs. The output has been shortened here for brevity.

9. Tokenizing row by row

If more control is needed, we can tokenize a dataset in batches or row by row with a custom function and the .map() method, setting batches to True or False, respectively. The result will be a new dataset object with new columns for the tokenized data, which is required for the training loop. Note that the .map() method does not accept list formats, only dataset objects.

10. Subword tokenization

The tokenization we've performed is known as subword tokenization, common in most modern tokenizers. Here, words are split into smaller, meaningful sub-parts of a word, including prefixes and suffixes. For example, with subword tokenization, a word like "unbelievably"

11. Subword tokenization

would be split into tokens "un", "believ", and "ably".

12. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.