Efficient fine-tuning with LoRA

1. Efficient fine-tuning with LoRA

In this video, we explore efficient fine-tuning with Low Rank Adaptation (LoRA), which helps fine-tune larger models with limited hardware.

2. What happens when we train a model?

Training a model goes through a few steps. It involves creating a vector from input data (tokens), multiplying it by a weight matrix (the model), and comparing outputs to reference vectors from training data. The error updates the weights. Larger models need more computation, so reducing model size makes training more efficient.

3. What is LoRA

Low-rank adaptation breaks down pre-trained model weight matrices into smaller, low-rank matrices, reducing training parameters while maintaining performance. These smaller matrices can then be trained instead of the larger original matrix representing the model, which is more efficient since there is less computation required. Reducing the number of parameters this way also acts as regularization as less granular information can be stored in fewer parameters. This method applies to various tasks and architectures.

4. How to implement LoRA using PEFT

We can set up LoRA fine-tuning using the parameter-efficient Fine-Tuning (peft) library. We will use LoraConfig to set up our LoRA parameters. The r parameter determines the rank or size of our low-rank matrices, a lower value here reduces the memory usage while a higher value is a better approximation of the original model. The lora_alpha, or scaling factor, determines how much weight to give to the gradient produced by the low rank matrices when updating the original model matrix. Lora_dropout controls the probability of dropping a layer in the lora matrix. bias sets if we want to train bias parameters or not, whether we want to have a global bias or layer by layer bias. The final two parameters are the task types: for Llama, this is causal_lm, and the target modules, which are the matrices of the LLM we want to run LoRA on.

5. Integrating LoRA configuration in training

To incorporate LoRA configuration during training, we make a slight adjustment to our training pipeline from the previous video. When setting up the SFTTrainer class, we pass the model, training dataset, maximum sequence length, and dataset text field, along with training arguments and the model tokenizer. Additionally, we include the LoRA configuration under peft_config with the parameters we specified, which we defined in the previous slide under lora_config. We can then initiate training with the train function.

6. LoRA vs regular finetuning

Without LoRA, with an Nvidia a100 GPU with 40gb RAM, the largest model we could train was TinyLlama, with 1.1 billion parameters on 11 thousand samples in 30 minutes. Any larger Llama variant would essentially run out of memory and would error out. Using LoRA, we were able to fine-tune a variant of Llama3 pre-trained for chat question answering, with 8 times more parameters than TinyLlama, while using the same dataset and hardware. It trained in about the same time too!

7. Let's practice!

Let's practice using LoRA.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.