1. Efficient fine-tuning in RLHF
Fine-tuning large language models in the context of RLHF can be memory-hungry. Let's have a look at techniques that address this challenge by reducing computational costs while maintaining model performance.
2. Parameter-efficient fine-tuning
Parameter-efficient fine-tuning, or PEFT, is a way to make updating large AI models more efficient by, instead of adjusting all model parameters,
3. Parameter-efficient fine-tuning
only adjusting a small number of them.
One popular method is called LoRA, which stands for Low-Rank Adaptation.
LoRA works by freezing most of the model and only adjusting a few new layers that are much smaller. This makes fine-tuning faster and uses less computing power while still delivering similar performance as if fully fine-tuning the whole model.
Coupled with LoRA, we can use quantization to reduce memory and computational costs. With quantization we use lower-precision data types for weights and activations. For example, using 8-bit integers instead of the standard 32 bits reduces the amount of memory required: an 8-bit model can fit four times more data into the same amount of memory compared to a 32-bit one. This allows loading larger models into memory and speeds up inference.
4. Step1: load your active model in 8-bit precision
The peft library from Hugging Face has some helpful functions that cover these techniques.
To prepare the pre-trained model for loading in 8-bit precision, we can use the load_in_8bit=True flag with the from_pretrained method.
So, let's start by loading the pre-trained model in 8-bit.
5. Step 2: add extra trainable adapters using peft
To fine-tune a model using LoRA, we need to instantiate a base model, such as the PPO model we are training.
We then create a configuration, LoraConfig, where specific parameters are defined.
We wrap the base model with get_peft_model() to get a trainable PeftModel.
In the configuration, we define all the key settings.
First, 'r' is the rank of the low-rank matrices, the matrices with fewer parameters used in LoRA. 'r' controls their size: a larger 'r' increases the number of adjustable parameters and allows the model to capture more information.
Next is lora_alpha, a scaling factor for LoRA updates. A higher alpha makes the updates stronger, meaning the model's parameter changes are more pronounced. This can improve training but also increase the chance of overfitting.
Lora_dropout is the dropout rate for LoRA layers, randomly turning off 10% of units during training to help prevent overfitting.
Finally, bias ensures only the parameters connected to the LoRA layers are updated, leaving all other parameters unchanged.
6. Step 3: use one model for reference and active logits
Using these configurations we can initialize the PPOTrainer, which fine-tunes our PPO model. By setting 'ref_model' to 'None', the original pre-trained model is used as the reference during training, avoiding the need for a separate reference model. The tokenizer preprocesses the input text, and the dataset provides the training examples. A data collator organizes batches for training, and the optimizer updates the model's parameters based on the learning process. This setup enables efficient fine-tuning while minimizing memory usage.
7. Let's practice!
Now, let's put this into practice and use RLHF in a memory-efficient way.