Speeding up inference in quantized models
Your company has been using a Llama model for their customer service chatbot for a while now with quantization. One of the biggest customer complaints you receive is that the bot answers questions very slowly and sometimes produces weird answers.
You suspect this might have to do with quantizing to 4-bit without normalizing. In your investigation, you also suspect that the speed trade-off comes from the inference computations, which are using 32-bit floats.
You want to adjust the quantization configurations to improve the inference speed of your model. The following imports have already been loaded: AutoModelForCausalLM
, AutoTokenizer
, and BitsAndBytesConfig
.
Cet exercice fait partie du cours
Fine-Tuning with Llama 3
Instructions
- Set quantization type to normalized 4-bit to reduce outliers, thus producing less nonsensical answers.
- Set the compute type to bfloat16 to speed up inference computation speeds.
Exercice interactif pratique
Passez de la théorie à la pratique avec l’un de nos exercices interactifs
