Building blocks to train LLMs

1. Building blocks to train LLMs

This video will focus on two popular pre-training techniques to build LLMs - next word prediction and masked language modeling.

2. Where are we?

These pre-training techniques form the foundation for many state-of-the-art language models. Recall that we prioritized discussing fine-tuning over pre-training because many organizations opt to fine-tune existing pre-trained models for their specific tasks rather than building a pre-trained model from scratch.

3. Generative pre-training

LLMs are typically trained using a technique called generative pre-training. This technique involves providing the model with a dataset of text tokens and training it to predict the tokens within it. Two commonly used styles of generative pre-training are next word prediction and masked language modeling.

4. Next word prediction

Let's start with the next word prediction. It is a supervised learning technique that trains the model on the input data and its corresponding output. Remember, supervised learning uses labeled data to classify or predict new data. Similarly, a next word prediction model is used to train a language model to predict the next word in a sentence, given the context of the words before it. The model learns to generate coherent text by capturing the dependencies between words in the larger context. During training, the model is presented with pairs of input and output examples.

5. Training data for next word prediction

For example, from the sentence "The quick brown fox jumps over the lazy dog", we can create input-output pairs for the model to learn from. During training, each generated output is added to the input for the next pair, allowing the model to predict the next output. For example, the output "fox" was generated for the input, "The quick brown" in the first pair. Now, this output "fox" gets added to the input of the second pair to become "The quick brown fox". The model takes this second input to predict "jumps" based on the context of the previous input-output pair. This is just one example. An LLM is typically trained on a large amount of such text data.

6. Which word relates more with pizza?

The more examples it sees, the better it predicts the next word. Once trained, we can use the model to generate new sentences one word at a time. For example, if we prompt it with "I love to eat pizza with blank", it is more likely to generate "cheese" instead of any other word like oregano, coffee, or ketchup. The model has learned from the training data that "cheese" occurs more often with pizza than anything else. Note that the probability of occurrence of the words here is a hypothetical example and not based on any specific data.

7. Masked language modeling

The second style of generative pre-training we will learn about is masked language modeling which involves training a model to predict a masked word that is selectively hidden in a sentence. For instance, if we mask the word "brown" in "The quick brown fox jumps over the lazy dog.", the sentence becomes "The quick [MASK] fox jumps over the lazy dog." During training, the model receives both the original and masked texts as input. The model's objective is to correctly predict the missing word between "quick" and "fox". Even though the masked word could be any color, the model has learned that it's "brown" based on the training data.

8. Let's practice!

We have learned about powerful techniques allowing the language model to learn contextual representations of words. Now, let's practice using these techniques.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.