Decoder transformers

1. Decoder transformers

As well as encoder transformers, we can also build decoder transformers.

2. From original to decoder-only transformer

The decoder-only architecture is again a simplified version of the original transformer,

3. From original to decoder-only transformer

specialized in sequence generation tasks not requiring an encoder. Concretely, it is designed to handle autoregressive sequence generation tasks like text generation and completion, where each token in the sequence is predicted based only on the preceding tokens. The architecture is similar to an encoder-only approach, with two differences.

4. From original to decoder-only transformer

One is the use of masked multi-head self-attention, which masks future tokens in the sequence to enable the model to learn and predict these future tokens using only the prior tokens. Otherwise, the model would be able to "look ahead" and cheat rather than learning to predict. For each token in the target sequence, only the previously generated tokens are observed, whereas subsequent tokens are hidden using a causal attention mask.

5. From original to decoder-only transformer

The other difference lies in the transformer head, which generally consists of a linear layer with softmax activation over the entire vocabulary to estimate the likelihood of each token being the next one in the sequence, and sampling a token from the most likely ones.

6. Masked self-attention/causal attention

Let's explore and implement these two elements, starting with masked self-attention, also called causal attention. This is key to giving our model an autoregressive or causal behavior, and it is achieved by using an upper triangular or causal attention mask as shown.

7. Masked self-attention/causal attention

By passing this matrix to the attention heads, each token only pays attention to the tokens that came before it in the sequence.

8. Masked self-attention/causal attention

For instance, during training, the token "favorite" in the sequence "orange is my favorite fruit" would only pay attention to the previous tokens: orange, is, my, and favorite. This way, during inference, the model will predict the likelihood of the next token, and should assign a relatively high probability to "fruit". To create the mask, we create a matrix of ones of shape sequence length by sequence length using torch.ones. Then, we torch.triu to make all elements of this matrix below the diagonal zero, subtract this matrix from one to get zero above the diagonal and minus one below it, and convert to bools, so zero values are False and non-zero values are True.

9. Decoder layer

The decoder layer is the same as the encoder layer we created before: multi-head attention followed by a feed-forward sublayer with layer normalizations and dropouts before and after. The only difference is the padding mask has been replaced with the causal attention mask in the forward pass.

10. Decoder transformer body and head

Regarding the transformer head, it can be implemented outside or inside the transformer body class. Let's illustrate how to add the head inside the decoder transformer class itself. Most of the code is identical to the encoder-only transformer class seen previously. We embed the inputs, create the positional encodings, and pass the input through the decoder layers. We add a final linear layer with output size equal to the vocabulary size, and softmax activation to the forward method, to project hidden states into word likelihoods.

11. Instantiating the decoder-only transformer

After we instantiate the decoder, we can pass it the causal mask to be used in the attention mechanism.

12. Let's practice!

Time to create a decoder-only transformer!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.