Get startedGet started for free

Balanced training with AdamW

1. Balanced training with AdamW

We've covered techniques for efficient training.

2. Efficient training

Now we'll examine optimizers as levers to improve distributed training efficiency,

3. Optimizers for training efficiency

focusing on AdamW,

4. Optimizers for training efficiency

Adafactor,

5. Optimizers for training efficiency

and 8-bit Adam.

6. Optimizer tradeoffs

These optimizers highlight tradeoffs, especially in distributed training with large models that increase computational demands.

7. Optimizer tradeoffs

To accelerate training, we consider reducing the number of parameters

8. Optimizer tradeoffs

or reducing precision.

9. Optimizer tradeoffs

AdamW is a common optimizer that balances these choices, helps models learn quickly, and serves as a benchmark for others.

10. How does AdamW work?

The training process with AdamW begins with the standard forward pass

11. How does AdamW work?

and gradient computation.

12. How does AdamW work?

To expedite learning, the optimizer computes an exponential moving average, or EMA, of the gradients, which is a weighted average of all past gradients, where weights decay exponentially further into the past.

13. How does AdamW work?

It also computes the EMA of squared gradients

14. How does AdamW work?

and the EMA of past parameters, known as weight decay.

15. How does AdamW work?

The training process combines these terms to update the model.

16. Memory usage of AdamW

AdamW computes the gradients of each parameter. Here, each square is a parameter, and each color is a state.

17. Memory usage of AdamW

AdamW tracks two states per parameter: the EMA of the gradients

18. Memory usage of AdamW

and the EMA of squared gradients, represented by two colors (green and red). In standard FP32 precision, each parameter uses 8 bytes—4 bytes per state for 2 states. The total memory usage is 8 bytes multiplied by the number of model parameters.

19. Estimate memory usage of AdamW

As an example, loading a Transformer model with approximately 65 million parameters, as calculated using numel(), results in an estimated memory usage of 8 bytes per parameter, totaling about 502 MB.

20. Trainer and Accelerator

Now we'll implement AdamW with Trainer and Accelerator.

21. Implement AdamW with Trainer

To use Trainer, we initialize AdamW with the model parameters, assuming the model and dataset are already defined. We pass the optimizer and lr_scheduler as arguments to the Trainer. Finally, we call trainer.train() to start training and monitor progress.

22. Implement AdamW with Accelerator

To use Accelerator, we first declare AdamW to use it within the training loop. The loop structure remains unchanged, and loss can be monitored during training.

23. Inspecting the optimizer state

After training, we can inspect the optimizer state. It shows the current step as three. exp_avg is the EMA of the gradients, represented by a tensor. exp_avg_sq is the EMA of squared gradients, also a tensor. Here we show example values, which happen to zeros.

24. Computing the optimizer size

To compute the size of an optimizer, we loop over the parameters in the optimizer. For each set of parameters, we extract tensors, like the EMA of the gradients and squared gradients. For each tensor, we extract the number of elements with numel() and element_size for the size of each element in bytes. We keep a running total of the number of elements and the total size in megabytes, which is the number of elements times the size of each element. Finally, we return these totals at the end.

25. Computing the optimizer size

After calling our function, we obtain the number of parameters as 131 million and the optimizer size as 502 megabytes. With AdamW being a common default optimizer, these values will serve as a baseline for comparing other optimizers.

26. Let's practice!

Over to you!