Model fine-tuning with Hugging Face

1. Model fine-tuning with Hugging Face

In this video, we are fine-tuning a Llama model for customer service, and evaluating its output.

2. What do we need to conduct fine-tuning?

Let's go back to our fine-tuning workflow. To fine-tune Llama with Hugging Face, we’ll need five essential components: First, the Llama model and tokenizer, that we'll use as our baseline. We'll then have a training dataset, like the Bitext customer service dataset, to provide the model with domain-specific examples for learning. We'll also define training arguments, which configure how the model learns, such as hyperparameters. We'll use the SFTTrainer class from the TRL library to conduct the fine-tuning process. Finally, we'll use ROUGE1 to assess the model's performance, and ensure our results meet expectations.

3. How to load models and tokenizers with Auto classes

First, we load the model and tokenizer from Hugging Face Hub using AutoModel and AutoTokenizer. We set the pad_token to the 'end-of-sentence' token to stop token generation when a sentence is finished.

4. Defining training parameters with TrainingArguments

Let's configure training arguments using Hugging Face's TrainingArguments helper class. Key parameters include batch size per GPU (per_device_train_batch_size), which controls how many samples to predict on before updating model weights based on the error of those predictions, and learning_rate. max_grad_norm limits gradient values during training to mitigate extreme outlier losses, smoothing learning. Other training arguments can also be configured.

5. How to set up training with SFTTrainer

To use SFTTrainer, we need to provide the model, tokenizer, and training dataset along with the column containing text examples, max_seq_length for clipping long outputs, and training arguments.

6. Understanding fine-tuning results with SFTTrainer

We can then call trainer.train() to start training. Metrics available for monitoring include the number of training steps it took, describing how many times we updated the weights, and the training loss. We also have some training metrics: how long it took to train in total, followed by the number of samples and steps per second, the number of operations, the training loss and the epoch.

7. How to evaluate a trained model Using ROUGE-1

We evaluate our fine-tuned model using the ROUGE1 metric, which measures word overlap between generated and ground truth sentences. To use ROUGE1, we load the evaluate library from Hugging Face, store predictions and references, then compute evaluation scores. The first example matches fully, so it scores one, the second has no matches and scores zero. Thus, the average score is 0.5.

8. How to use the ROUGE-1 score

We'll use an evaluation set from the last 500 elements of the bitext dataset, stored in evaluation_dataset. This function tokenizes the instruction text through tokenizer.encode, which converts the text into a list of numerical values that the model actually takes in as inputs, generates outputs with the model with those tokens, and decodes them back to natural text from the list of numbers. We follow these steps by returning the reference and generated answers.

9. How to run ROUGE-1 on an evaluation set

Now we can produce the evaluation set by using our preparation function. We initialize our evaluator and run the compute method with our predictions and reference answers. Then, we print the results.

10. Finetuning vs no finetuning

Let's compare the results of running this evaluation with the original model, on the right. We've improved the performance almost 50 percent!

11. Let's practice!

Time to practice with Llama!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.