Loading 8-bit models
Your company has been using a Llama model for their customer service chatbot for a while now. You've been tasked with figuring out how to reduce the model's GPU memory usage without significantly affecting performance. This will allow the team to switch to a cheaper compute cluster and save the company a lot of money.
You decide to test if you can load your model with 8-bit quantization and maintain a reasonable performance.
You are given the model in model_name. AutoModelForCausalLM and AutoTokenizer are already imported for you.
Este exercício faz parte do curso
Fine-Tuning with Llama 3
Instruções do exercício
- Import the configuration class to enable loading of models with quantization.
- Instantiate the quantization configuration class.
- Configure the quantization parameters to load the model in 8-bit.
- Pass quantization configuration to
AutoModelForCausalLMto load the quantized model.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# Import quantization configuration class
from ____ import ____
# Instantiate quantization configuration
bnb_config = ____(
# Set 8-bit loading
____=True,
)
model = AutoModelForCausalLM.from_pretrained(
"Maykeye/TinyLLama-v0",
# Set quantization parameters to load quantized model
____=bnb_config,
low_cpu_mem_usage=True
)