The decoder layer
Like encoder transformers, decoder transformers are also built of multiple layers that make use of multi-head attention and feed-forward sublayers. Have a go at combining these components to build a DecoderLayer
class.
The MultiHeadAttention
and FeedForwardSubLayer
classes are available for you to use, and along with the tgt_mask
you created.
This exercise is part of the course
Transformer Models with PyTorch
Exercise instructions
Complete the forward()
method to pass the input embeddings through the layers defined in the __init__
method:
- Perform the attention calculation using the
tgt_mask
provided and the input embeddings,x
, for the query, key, and value matrices. - Apply
dropout
and the first layer normalization,norm1
. - Perform the pass through the feed-forward sublayer,
ff_sublayer
. - Apply
dropout
and the second layer normalization,norm2
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.ff_sublayer = FeedForwardSubLayer(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, tgt_mask):
# Perform the attention calculation
attn_output = self.____
# Apply dropout and the first layer normalization
x = self.____(x + self.____(attn_output))
# Pass through the feed-forward sublayer
ff_output = self.____(x)
# Apply dropout and the second layer normalization
x = self.____(x + self.____(ff_output))
return x