The decoder layer
Like encoder transformers, decoder transformers are also built of multiple layers that make use of multi-head attention and feed-forward sublayers. Have a go at combining these components to build a DecoderLayer class.
The MultiHeadAttention and FeedForwardSubLayer classes are available for you to use, and along with the tgt_mask you created.
This exercise is part of the course
Transformer Models with PyTorch
Exercise instructions
Complete the forward() method to pass the input embeddings through the layers defined in the __init__ method:
- Perform the attention calculation using the
tgt_maskprovided and the input embeddings,x, for the query, key, and value matrices. - Apply
dropoutand the first layer normalization,norm1. - Perform the pass through the feed-forward sublayer,
ff_sublayer. - Apply
dropoutand the second layer normalization,norm2.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.ff_sublayer = FeedForwardSubLayer(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, tgt_mask):
# Perform the attention calculation
attn_output = self.____
# Apply dropout and the first layer normalization
x = self.____(x + self.____(attn_output))
# Pass through the feed-forward sublayer
ff_output = self.____(x)
# Apply dropout and the second layer normalization
x = self.____(x + self.____(ff_output))
return x