Tokenization and Transformers

1. Tokenization and Transformers

Welcome back!

2. Tokenization

Tokens are the building blocks of LLMs like ChatGPT. Understanding how tokenization works is essential to grasping the inner workings of the Transformer architecture.

3. Tokenization

Consider the following sentence from The Terminator: “I’ll be back.” This sentence contains 5 tokens and 13 characters. Tokenization transforms the words into something computers can understand. Firstly, syntax. This is the structure of sentences. Secondly, semantics. This is the meaning conveyed by the sentences.

4. Encoding

When you input text into a language model, the first step is to convert it into tokens. Tokens are numerical representations of pieces of text. Each token is assigned a unique identifier called a token ID. In our example, “I’ll be back” is split into 5 tokens, each with a corresponding ID.

5. Vectorization

Once we have token IDs, the next step is to convert these IDs into vectors, which are numerical arrays that capture the semantic meaning of the tokens. This process is called embedding.

6. Decoding

After processing and generating a response, the model needs to convert the numerical data back into human-readable text. This is where decoding comes into play. Decoding translates the generated sequence of token IDs back into words and sentences.

7. Self-attention

The Transformer uses self-attention, sometimes referred to as intra-attention, to compute a representation of its input and output without using sequence-aligned RNNs. Self-attention allows the model to weigh the importance of each token in the input sequence relative to all other tokens. This mechanism enables the model to capture relationships and dependencies between tokens, regardless of their distance from each other in the sequence.

8. Self-attention

For example, in the sentence "The cat sat on the mat because it was tired," the word "it" refers to "the cat." Self-attention helps the model recognize that "it" is linked to "the cat," ensuring the context is preserved when generating text or translating languages.

9. Token relationships

In the sentence "The cat sat on the mat because it was tired," the model compares each token with every other token in the sequence. For instance: "it" vs. "the cat": The model evaluates the relevance of "it" to "the cat" and determines that "it" refers to "the cat." "it" vs. "sat": The model assesses the relationship between "it" and "sat," but finds a lower relevance because "it" is more likely to refer to a noun (in this case, "the cat") than to a verb. "tired" vs. "the cat": The model considers how "tired" relates to "the cat" and establishes that "tired" is a state attributed to "the cat." Through these comparisons, the model assigns attention scores that quantify the importance of each token in relation to others.

10. Contextual understanding

With the attention scores calculated, the model focuses on the most relevant tokens to build contextual understanding. Because the attention score between "it" and "the cat" is high, the model understands that "it" is referring to "the cat," not any other noun.

11. Let's practice!

Now that you’ve learned about the Transformer architecture and the fundamental concepts of tokenization, encoding, vectorization, decoding, and self-attention, it’s time to put your knowledge into practice.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.