1. Tuning Llama 3 parameters
Let's continue by exploring how to tune Llama's responses.
2. What are parameters for?
Previously, we generated basic responses with Llama 3. Now, how do we control their quality, randomness, and length?
3. What are parameters for?
Let's say we are generating product descriptions for different e-commerce sites. Some need to be factual and concise, while others should be engaging and creative.
4. Llama 3 decoding parameters
We can adjust Llama's behavior using decoding parameters to match different tones. These parameters 'decode', or transform, raw model output into readable text, allowing us to modify responses while keeping the core content.
5. Llama 3 decoding parameters
We'll explore four key parameters.
The temperature controls randomness.
Top-k limits token selection to the most probable choices.
Top-p adjusts token selection based on cumulative probability.
Max tokens limits response length.
6. Temperature
Let's see the parameters in action.
Temperature values are usually between 0 and 1.
If we ask to write a product description with a temperature closer to zero, the output produced will be more predictable. In this example describing a smartwatch, it highlights essential features like a heart rate monitor, GPS, and long battery life.
With a high temperature, with a value closer to one, the model will produce more creative responses.
7. Top-k
Top-k controls how many words the model considers using every time it adds a new word to the response.
With a low top-k, such as 1, the model picks only the most probable word, resulting in a predictable but potentially repetitive response. For example, with the smartwatch description, this might result in a straightforward list of features, similar to what happened with low temperature.
With a higher top-k, such as 50, the model has more words to choose from, leading to a more expressive response.
8. Top-p
Top-p is a similar parameter as top-k, but based on the probability that the model will select certain words instead of the number of words.
A high top-p allows for more varied responses, listing multiple smartwatch features.
A low top-p keeps the output focused and precise, only mentioning core functionalities.
9. Max tokens
To limit response length, we use max_tokens.
This is the direct count of how many words, or units of words called tokens, should be in the response. If we need a short and precise summary, we can limit the response to just a few words, while if we need a more detailed explanation, we increase max_tokens.
10. Combining different parameters
Let's combine them all together in this example where we want describe an electric car: setting temperature to 0.2, top-k to 1, top-p to 0.4, and max tokens to 20 will result in a short, concise product description.
11. Combining different parameters
We'll get a much more creative output with a temperature of 0.8, top-k of 10, top-p of 0.9, and max tokens of 100.
12. Let's practice!
Let's tune Llama 3 parameters!