1
The Building Blocks of Transformer Models
Free
Discover what makes the hottest deep learning architecture in AI tick! Learn about the components that make up Transformer models, including the famous self-attention mechanisms described in the renowned paper "Attention is All You Need."
2
Building Transformer Architectures
Design transformer encoder and decoder blocks, and combine them with positional encoding, multi-headed attention, and position-wise feed-forward networks to build your very own Transformer architectures. Along the way, you'll develop a deep understanding and appreciation for how transformers work under the hood.

Initializing

Adding methods to the MultiHeadAttention class

In this exercise, you'll build the rest of the MultiHeadAttention class from the ground up by defining four methods:

.split_heads(): split and transform the input embeddings between the attention heads
.compute_attention(): calculate the scaled dot-product attention weights multiplied by the values matrix
.combine_heads(): transform the attention weights back into the same shape as the input embeddings, x
.forward(): call the other methods to pass the input embeddings through each process

torch.nn has been imported as nn, torch.nn.functional is available as F, and torch is also available.

Split the input embeddings, x, between the attention heads by reshaping them to (batch_size, seq_length, self.num_heads, self.head_dim).