Get startedGet started for free

Speeding up inference in quantized models

Your company has been using a Llama model for their customer service chatbot for a while now with quantization. One of the biggest customer complaints you receive is that the bot answers questions very slowly and sometimes produces weird answers.

You suspect this might have to do with quantizing to 4-bit without normalizing. In your investigation, you also suspect that the speed trade-off comes from the inference computations, which are using 32-bit floats.

You want to adjust the quantization configurations to improve the inference speed of your model. The following imports have already been loaded: AutoModelForCausalLM, AutoTokenizer, and BitsAndBytesConfig.

This exercise is part of the course

Fine-Tuning with Llama 3

View Course

Exercise instructions

  • Set quantization type to normalized 4-bit to reduce outliers, thus producing less nonsensical answers.
  • Set the compute type to bfloat16 to speed up inference computation speeds.

Hands-on interactive exercise

Turn theory into action with one of our interactive exercises

Start Exercise