1. Prepare models with AutoModel and Accelerator
Welcome to our course on Distributed AI Training! I'm Dennis Lee, your instructor.
2. Meet your instructor!
I'm a data engineer in the tech industry, building data infrastructure and optimizing supply chain networks.
3. Meet your instructor!
I've also been a data scientist in management consulting, developing machine learning techniques in distributed settings.
4. Meet your instructor!
During my Ph.D. in Electrical Engineering, I investigated optimization algorithms for physics and optics. I'm passionate about simplifying science and technology for everyone
5. Meet your instructor!
and excited to share my knowledge.
6. Our roadmap to efficient AI training
We'll learn how to train models across multiple devices, known as distributed AI model training, addressing challenges like hardware constraints,
7. Our roadmap to efficient AI training
lengthy training times,
8. Our roadmap to efficient AI training
and memory limitations. This approach reduces training times for large language models, with trillions of parameters, from hundreds of years to weeks. By the end, we'll learn how to build scalable machine learning models for various AI-powered applications.
9. Our roadmap to efficient AI training
We'll cover data preparation, where we place data on multiple devices;
10. Our roadmap to efficient AI training
distributed training, which scales training to multiple devices;
11. Our roadmap to efficient AI training
efficient training, which optimizes available devices;
12. Our roadmap to efficient AI training
and optimizers, which can help speed up distributed training.
13. CPUs vs GPUs
Let's get started. Distributed training typically occurs on CPUs and GPUs. Most laptops have CPUs, but some have GPUs that can train large models, such as high-end gaming laptops.
14. CPUs vs GPUs
CPUs are designed for general-purpose computing, like word processing, whereas GPUs specialize in highly parallel computing. CPUs have better control flow, while GPUs excel at matrix operations. All of the libraries in this course, including Accelerator, run on both devices.
15. Distributed training
Now that we understand where distributed training occurs, let's explore how it works. The key steps are data sharding (where each device processes a subset of data in parallel),
16. Distributed training
model replication (where each device performs forward and backward passes on its subset of data and copy of the model), gradient aggregation (where a designated device aggregates gradients from devices), and parameter synchronization (where the designated device shares updated model parameters across devices). We'll discuss these terms later.
17. Effortless efficiency: leveraging pre-trained models
To begin training, we'll leverage Hugging Face Transformer models for audio, images, and text. We initialize model parameters by calling AutoModelForSequenceClassification and display the configuration to see information about the architecture, such as the model name and parameters. Note that there are different AutoModels for different tasks.
18. Device placement with Accelerator
Next, we'll prepare our model for distributed training using the Accelerator class from the Hugging Face ecosystem, which integrates seamlessly with Transformers and other Hugging Face libraries. Accelerator detects the devices available on our computer and automates device placement and data parallelism through accelerator.prepare(). This method places the model on the first available GPU or defaults to the CPU if no GPU is found. It works with PyTorch models of type torch.nn.Module. Finally, we can display the device Accelerator has selected.
19. Let's practice!
Now practice preparing models!