Encoder transformers

1. Encoder transformers

Let's build our first transformer using the components we've defined.

2. The original transformer

The transformer architecture combines an encoder and decoder for sequence-to-sequence language representation and generation.

3. The original transformer

The output from the encoder is inputted into the decoder layers to form the bridge between the two blocks.

4. Encoder-only transformers

Encoder-only transformers simplify this architecture to place greater emphasis on understanding and representing the input data, such as text classification. They have two main components: the transformer body and head. The transformer body, or encoder, in this case, is a stack of multiple encoder layers designed to learn complex patterns from the inputs. Each encoder layer incorporates a multi-head self-attention mechanism to capture relationships between tokens in the sequence, followed by feed-forward sublayers to map this knowledge into abstract, nonlinear representations. Both elements are usually combined with other techniques like layer normalizations and dropouts to improve training.

5. Encoder-only transformers

The head is the final layer designed to produce task-specific outputs. In encoder transformers, they are typically supervised learning outputs like classification labels or regression predictions.

6. Feed-forward sublayer in encoder layers

Let's look at the feed-forward sublayer. Our FeedForwardSublayer class contains two fully connected linear layers separated by a ReLU activation. Notice we use a dimension d_ff between linear layers, typically different from the embedding dimension used throughout the model to further facilitate capturing complex patterns. The forward method applies the forward pass to the attention mechanism outputs, passing them through the layers.

7. Encoder layer

We've identified two key elements for implementing an encoder layer, multi-head self-attention and feed-forward sublayers. These are incorporated into the encoder layer with layer normalizations for keeping the scales and variances of input embeddings consistent before and after the feed forward layer. Dropouts are also used to regularize and stabilize training. Notice when calling the attention mechanism in the forward pass the input embeddings are passed as the query, key, and value matrices, and a mask is used to prevent the processing of padding tokens in the input sequence. Let's explore masks further.

8. Masking the attention process

In NLP tasks with variable-length input sequences, padding ensures equal length across sequences by appending padding tokens, or zeros in this case. These padded tokens are irrelevant for the language task so we exclude them from the attention mechanism. We do this by applying a padding mask to the attention scores, so scores linked to padded tokens are set to zero.

9. Encoder transformer body

Once we fully implement an encoder layer, we stack several layers together to build our transformer body: the encoder. We define the token embedding based on the vocabulary size, followed by positional encoding using the previously-defined class, and a stack of multiple encoder layers, using PyTorch's ModuleList class and a list comprehension. The forward pass embeds the inputs, performs positional encoding, and iterates through the encoder layers, applying the padding mask.

10. Encoder transformer head

For the transformer head, we can create a classification head, suitable for tasks like text classification and sentiment analysis. It consists of a linear layer with softmax activation to map the resulting encoder hidden states into class probabilities. A regression head has a linear layer with an output dimension equal to the number of target regression outputs. It's suited for tasks like estimating text readability or language complexity.

11. Let's practice!

Let's build an encoder transformer!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.