Loading 8-bit models
Your company has been using a Llama model for their customer service chatbot for a while now. You've been tasked with figuring out how to reduce the model's GPU memory usage without significantly affecting performance. This will allow the team to switch to a cheaper compute cluster and save the company a lot of money.
You decide to test if you can load your model with 8-bit quantization and maintain a reasonable performance.
You are given the model in model_name
. AutoModelForCausalLM
and AutoTokenizer
are already imported for you.
This exercise is part of the course
Fine-Tuning with Llama 3
Exercise instructions
- Import the configuration class to enable loading of models with quantization.
- Instantiate the quantization configuration class.
- Configure the quantization parameters to load the model in 8-bit.
- Pass quantization configuration to
AutoModelForCausalLM
to load the quantized model.
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
