Get startedGet started for free

Memory-efficient training with Adafactor

1. Memory-efficient training with Adafactor

Now that we've seen AdamW as a common optimizer,

2. Optimizers for training efficiency

let's compare it with Adafactor.

3. Optimizer tradeoffs

Adafactor saves memory by storing fewer parameters; the tradeoff is that some models may not perform as well. Training involves some experimentation to see which optimizers work well with models.

4. How does Adafactor work?

Adafactor also begins with a forward pass

5. How does Adafactor work?

and gradient computation.

6. How does Adafactor work?

From the gradients, Adafactor estimates the exponential moving average (or EMA) of the squared gradients, or second moment. Unlike AdamW, Adafactor doesn't compute the EMA of the gradients or past parameters.

7. How does Adafactor work?

Adafactor updates models using the gradient and second moment.

8. How does Adafactor save memory?

During the training process, Adafactor saves memory by not storing the full second moment matrix;

9. How does Adafactor save memory?

instead it stores the column sum

10. How does Adafactor save memory?

and row sum of the matrix. The column or row sum is the sum of all elements in each column or row of the matrix, respectively.

11. How does Adafactor save memory?

It estimates the full matrix by multiplying the column sum and the row sum. We won't derive the mathematics here, but the main takeaway is that Adafactor saves memory this way.

12. Trainer and Accelerator implementation

Next we train with Adafactor and analyze its memory usage afterwards. Trainer provides a simplified interface with no training loop, while Accelerator allows us to customize training loops.

13. Implement Adafactor with Trainer

Throughout these examples, we assume we've loaded training objects like the model and dataset. In TrainingArguments, evaluation_strategy can be "epoch," or "steps" for more frequent evaluation. We set the optim argument to adafactor. Then we pass TrainingArguments to Trainer. Training proceeds as usual, and we can see evaluation metrics with each epoch. Ideally, accuracy and F1 score increase as we train.

14. Implement Adafactor with Accelerator

To use Accelerator, we import Adafactor, assuming we meet a minimum version of PyTorch, and define the optimizer. Then our training loop proceeds as before, and we can monitor loss during training.

15. Inspect the optimizer state

Once training finishes, we inspect the optimizer state to understand how it works. We can access the optimizer through its state attribute. Alternatively, we can access the optimizer through Trainer when using TrainingArguments. Printing the state shows the row and column sums of the second moment, denoted by exp_avg_sq_row and exp_avg_sq_col. These outputs confirm our understanding of how Adafactor saves memory by storing the row and column sums.

16. Compute memory usage of Adafactor

To quantify how much memory Adafactor saves, we call our compute_optimizer_size function that we defined earlier. The function returns the number of parameters and memory usage. Compared to AdamW, Adafactor uses drastically less memory - 1 MB versus 502 MB! Adafactor is great to consider with large models that may have trouble fitting into memory.

17. Let's practice!

Now it's your turn!