The decoder layer

Like encoder transformers, decoder transformers are also built of multiple layers that make use of multi-head attention and feed-forward sublayers. Have a go at combining these components to build a DecoderLayer class.

The MultiHeadAttention and FeedForwardSubLayer classes are available for you to use, and along with the tgt_mask you created.

Complete the forward() method to pass the input embeddings through the layers defined in the __init__ method:

Perform the attention calculation using the tgt_mask provided and the input embeddings, x, for the query, key, and value matrices.
Apply dropout and the first layer normalization, norm1.
Perform the pass through the feed-forward sublayer, ff_sublayer.
Apply dropout and the second layer normalization, norm2.

The Building Blocks of Transformer Models

Building Transformer Architectures

Exercise

The decoder layer

Instructions