Preparing for fine-tuning
1. Preparing for fine-tuning
Welcome back. So far, we have used the pipeline() interface.2. Pipelines and auto classes
It streamlines language tasks by automatically selecting a model and tokenizer but offers limited control. Auto classes allow more customization, enabling manual adjustments and model fine-tuning, which we'll cover next.3. LLM lifecycle
The LLM development lifecycle is similar to other machine and deep learning models. However, LLM training has two phases: pre-training on a broad dataset to learn general language patterns,4. LLM lifecycle
followed by fine-tuning on domain specific data to adapt it for specialized tasks. For instance, an insurance company's data used to fine-tune an LLM to handle insurance customer inquiries.5. Loading a dataset for fine-tuning
To begin fine-tuning, we prepare the dataset to ensure the model learns the right patterns. The Hugging Face hub offers a datasets library with a vast collection great for experimentation. Our goal is to fine-tune a pre-trained model with the imdb data. Here, we load a movie review dataset using the load_dataset function, setting the split parameter to extract training and test data. We use .shard() to split this large dataset into four chunks and selecting the first chunk to save with index equals zero. We've done this to speed up training; the extraction process will differ depending on the data and computational needs.6. Auto classes
Recall these two common auto classes in Hugging Face's transformer library: AutoModel and AutoTokenizer. There are also auto classes that support loading task-specific models, such as AutoModelForSequenceClassification, which are suitable for loading sentiment classification models. The from_pretrained method loads a specified pre-trained model with its learned weights and a suitable tokenizer for that model.7. Tokenization
After loading and instantiating the data, model, and tokenizer, we tokenize the data subset in one go by selecting the text column, enabling padding and sequence truncation when exceeding the specified maximum length. This helps with efficiency. We set return_tensors to pt to return PyTorch tensors since our model expects this format.8. Tokenization output
Printing the output will show us the truncated list of token IDs. The output has been shortened here for brevity.9. Tokenizing row by row
If more control is needed, we can tokenize a dataset in batches or row by row with a custom function and the .map() method, setting batches to True or False, respectively. The result will be a new dataset object with new columns for the tokenized data, which is required for the training loop. Note that the .map() method does not accept list formats, only dataset objects.10. Subword tokenization
The tokenization we've performed is known as subword tokenization, common in most modern tokenizers. Here, words are split into smaller, meaningful sub-parts of a word, including prefixes and suffixes. For example, with subword tokenization, a word like "unbelievably"11. Subword tokenization
would be split into tokens "un", "believ", and "ably".12. Let's practice!
Let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.