Embedding and positional encoding
1. Embedding and positional encoding
Let's begin constructing our own transformer, starting with classes for embedding sequences and generating positional encodings.2. Embedding and positional encoding in transformers
The transformer architecture starts with embedding sequences as vectors, and then encoding each token's position in the sequence so that tokens can be processed in parallel.3. Embedding sequences
Suppose our input sequence has three tokens.4. Embedding sequences
Each token has a unique ID in the model vocabulary, which is the set of tokens that the model can recognize.5. Embedding sequences
When we embed them using an embedding layer, we get an embedding vector. The length of this vector is also referred to as the number of dimensions, or dimensionality. Let's create a class to embed tokens for our transformer model.6. Creating an embedding class
The InputEmbeddings class inherits from nn.Module. In the __init__ method, we define d_model, the dimensionality of the input embeddings, the model vocabulary size, and an embedding layer to map each token in the vocabulary to a vector. The forward method calculates and returns the embeddings. Scaling the embeddings by the square root of the dimensionality is a standard practice that ensures the token embeddings don't overwhelm or be overwhelmed by the positional embeddings.7. Creating embeddings
Let's define an embedding layer with dimensionality 512 and vocab_size 10000. Passing an example batch of two sequences each with four token IDs into the embedding layer and printing the shape shows embeddings of the correct dimensionality for each token ID and batch.8. Positional encoding
Let's now discuss positional encoding. This encodes each token's position in the sequence into a positional embedding and adds them to the token embeddings to capture the positional information. The token and positional embeddings usually have the same dimensionality for easier addition.9. Positional encoding
These positional embeddings are generated using an equation, which we'll see in a moment, that uses the token's position, and the sin and cosine mathematical functions, to generate unique positional embeddings. Sin is used for even embedding values, and cosine is used for odd values.10. Sin and cosine
sin and cos are periodic functions that output a value between 1 and -1 for any value of x. These functions are central in the positional encoding calculation. For a token of a given position, its positional embedding vector can be calculated from these functions. The first value of the positional embedding vector, i=0, is calculated using the sin equation with i=0, as it is even. The second value, i=1, uses the cosine equation, and so on. d_model is the dimensionality, which is the same as the token embedding dimensionality. Let's code this out!11. Building a positional encoder
We define a PositionalEncoding class and initialize the positional embeddings, pe, to zeros. Next, we create a tensor of positions for each token in the sequence, which we'll transform using unsqueeze so it can be used in positional encoding calculations. Here are the sin and cosine calculations, matching the equations we just saw. The important thing to remember is that sin is applied to even vector values and cosine to odd values. register_buffer stores pe without making it a learnable parameter during training. Finally, we add the positional embeddings to the input token embeddings, x.12. Creating positional encodings
Let's create a positional encoder for our token embeddings. Applying the layer to the embedded_output produces a tensor of the same shape as the token embeddings.13. Let's practice!
Now it's your turn!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.