Loading 8-bit models
Your company has been using a Llama model for their customer service chatbot for a while now. You've been tasked with figuring out how to reduce the model's GPU memory usage without significantly affecting performance. This will allow the team to switch to a cheaper compute cluster and save the company a lot of money.
You decide to test if you can load your model with 8-bit quantization and maintain a reasonable performance.
You are given the model in model_name
. AutoModelForCausalLM
and AutoTokenizer
are already imported for you.
Diese Übung ist Teil des Kurses
Fine-Tuning with Llama 3
Anleitung zur Übung
- Import the configuration class to enable loading of models with quantization.
- Instantiate the quantization configuration class.
- Configure the quantization parameters to load the model in 8-bit.
- Pass quantization configuration to
AutoModelForCausalLM
to load the quantized model.
Interaktive Übung
Setze die Theorie in einer unserer interaktiven Übungen in die Praxis um
