Exercise

Starting the MultiHeadAttentionClass

Now that you've defined classes for creating token embeddings and positional embeddings, it's time to define a class for performing multi-head attention. To start, set up the parameters used for the attention calculation and the linear layers used for transforming the input embeddings into query, key, and value matrices, and one for projecting the combined attention weights back into embeddings.

torch.nn has been imported as nn.

Instructions

100 XP

Calculate the embedding dimensions each attention head will process, head_dim.
Define the three input layers (for query, key, and value) and one output layer; remove the bias parameter from the input layers.

.css-6su6fj{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;}Exercise

Instructions

Exercise