Transformers with PyTorch

1. Breaking down the Transformer

Hi, welcome to the course! I'm James, and together, we'll learn how to build transformer models from the ground up with PyTorch.

2. The paper that changed everything...

In 2017, the transformer deep learning architecture burst onto the scene with the release of the paper, Attention Is All You Need. The architecture, shown here, brought together advancements in the deep learning field, like attention mechanisms, into a new architecture optimized for modeling sequences like text. Now, transformers are ubiquitous in deep learning, kickstarting the generative AI boom by forming the base for modern large language models, or LLMs. In this course, we'll build a transformer model, component by component, to understand how exactly these revolutionary models work. The transformer architecture is made of two blocks:

3. The paper that changed everything...

the encoder block

4. The paper that changed everything...

and the decoder block.

5. Unpacking the Transformer...

The encoder block consists of multiple identical layers that are responsible for reading and processing the entire input sequence, generating context-rich numerical representations. It does this using self-attention and feed-forward networks, which we'll discuss in a moment.

6. Unpacking the Transformer...

The decoder block essentially does the inverse of the encoder block, generating an output sequence based on the encoded input sequence.

7. Unpacking the Transformer...

Key to both encoder and decoder blocks is positional, which allows tokens to be processed in parallel by encoding each token's position in the sequence. This enables the model to recognize the relationships between tokens and their order, essential for making sense of sentences and capturing their context.

8. Unpacking the Transformer...

Attention mechanisms are used to highlight the most important tokens and their relationships, which improves the quality of generated text.

9. Unpacking the Transformer...

Self-attention is a type of attention mechanism that assigns a weight to each token in the sequence simultaneously, capturing long-range dependencies.

10. Unpacking the Transformer...

Multi-head attention extends self-attention by using multiple "heads" to focus on different aspects of the input sequence in parallel. This allows each head to capture distinct relational patterns within the data, leading to richer representations that enhance LLMs' effectiveness across tasks.

11. Unpacking the Transformer...

Finally, we have position-wise feed-forward networks. These are simple neural networks that apply complex transformations on each token's embeddings independently. Because each token gets its own transformation, the networks are position-independent, hence, the "position-wise". We'll explore these components as we build our own transformers.

12. Transformers in PyTorch

Like other models you may have come across, PyTorch provides a high-level class in torch.nn to quickly define an architecture. nn.Transformer() takes four key parameters: d_model, the dimensionality of the embedded sequence, the number of attention heads, and the number of encoder and decoder layers to include in the respective blocks. Let's view the model object.

13. Transformers in PyTorch

The output is pretty long. Inside the Transformer, we can actually see the transformer encoder block containing six transformer encoder layers, along with multi-head attention, and other bits you may have encountered elsewhere in deep learning, like linear layers, dropout layers, and layer normalization. We'll include many of these pieces in our own transformer models.

14. Transformers in PyTorch

Further down, we can also see the decoder block with six transformer decoder layers.

15. Let's practice!

With these components in mind, let's build transformers with PyTorch!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.