LoslegenKostenlos loslegen

Speeding up inference in quantized models

Your company has been using a Llama model for their customer service chatbot for a while now with quantization. One of the biggest customer complaints you receive is that the bot answers questions very slowly and sometimes produces weird answers.

You suspect this might have to do with quantizing to 4-bit without normalizing. In your investigation, you also suspect that the speed trade-off comes from the inference computations, which are using 32-bit floats.

You want to adjust the quantization configurations to improve the inference speed of your model. The following imports have already been loaded: AutoModelForCausalLM, AutoTokenizer, and BitsAndBytesConfig.

Diese Übung ist Teil des Kurses

Fine-Tuning with Llama 3

Kurs anzeigen

Anleitung zur Übung

  • Set quantization type to normalized 4-bit to reduce outliers, thus producing less nonsensical answers.
  • Set the compute type to bfloat16 to speed up inference computation speeds.

Interaktive Übung

Setze die Theorie in einer unserer interaktiven Übungen in die Praxis um

Übung starten