Get startedGet started for free

Mixed precision training

1. Mixed precision training

Let's examine

2. Mixed precision training accelerates computation

another technique to speed up distributed training, mixed precision training. We'll start by reviewing what precision is.

3. Faster calculations with less precision

Computers represent numbers through the sign, mantissa, and exponent. For example, we represent 0.125 with a positive sign, a mantissa of 1.25, and an exponent of -1.

4. Faster calculations with less precision

These numbers are stored as bits, and precision refers to the number of bits that represents a number. The standard representation is 32-bit floating point, or FP32: 1 bit for the sign, 23 bits for the mantissa, and 8 bits for the exponent.

5. Faster calculations with less precision

Lower precisions are faster to compute; for example, 2 x 2 requires fewer calculations than 222 x 222. 16-bit floating point, or FP16, uses 1 bit for the sign, 10 bits for the mantissa, and 5 bits for the exponent.

6. What is mixed precision training?

Mixed precision training performs computations with FP16 and FP32 to speed up training.

7. What is mixed precision training?

First, the model performs a forward pass using weights stored in FP16, represented as green.

8. What is mixed precision training?

Next it computes loss in FP32 (represented as yellow) to avoid underflow, which occurs when a number vanishes to zero because it falls below precision. For example, numbers below 6e-5 are 0 in FP16.

9. What is mixed precision training?

Then we scale loss by multiplying it with a large factor to prevent underflow, or small numbers vanishing to zero.

10. What is mixed precision training?

The model computes gradients in FP16 during the backward pass.

11. What is mixed precision training?

We divide gradients by the same scale factor to undo scaling.

12. What is mixed precision training?

Finally, we update model parameters with gradients in FP32

13. What is mixed precision training?

and store model parameters in FP16.

14. PyTorch implementation

Now we'll use PyTorch to build a mixed precision training loop.

15. Mixed precision training with PyTorch

We begin by defining GradScalar(). GradScalar() implements gradient scaling to prevent numerical underflow. After we load batch data, we define a torch.autocast() block, so the forward pass inside the block uses FP16 precision. In autocast(), we can specify device_type as cpu or cuda depending on which device is available. cuda refers to a special type of GPU, known as NVIDIA GPUs. scalar.scale() scales the loss to prevent numerical underflow, and then backward() computes the gradients. scalar.step() unscales the gradients and steps the optimizer, updating model parameters. scalar.update() updates the scale factor. Finally, optimizer.zero_grad() zeros the gradients.

16. From PyTorch to Accelerator

The PyTorch implementation helps us understand under-the-hood details of Hugging Face classes like Accelerator and Trainer.

17. From PyTorch to Accelerator

Next we'll see how Accelerator simplifies the training loop.

18. Mixed precision training with Accelerator

We begin by defining Accelerator with mixed_precision as fp16. The prepare method enables mixed precision for our training objects. The rest of the loop doesn't require modifications; gradient scaling and underflow are automatically handled.

19. From Accelerator to Trainer

The Accelerator example showed how to simplify training loops.

20. From Accelerator to Trainer

Most Hugging Face Transformer models don't require custom loops, and Trainer is a good fit for these cases.

21. Mixed precision training with Trainer

For Trainer, we set fp16 to True in TrainingArguments, pass it to Trainer, and begin training with mixed precision.

22. Let's practice!

Your turn to accelerate computation with mixed precision training!