Aan de slagGa gratis aan de slag

Speeding up inference in quantized models

Your company has been using a Llama model for their customer service chatbot for a while now with quantization. One of the biggest customer complaints you receive is that the bot answers questions very slowly and sometimes produces weird answers.

You suspect this might have to do with quantizing to 4-bit without normalizing. In your investigation, you also suspect that the speed trade-off comes from the inference computations, which are using 32-bit floats.

You want to adjust the quantization configurations to improve the inference speed of your model. The following imports have already been loaded: AutoModelForCausalLM, AutoTokenizer, and BitsAndBytesConfig.

Deze oefening maakt deel uit van de cursus

Fine-Tuning with Llama 3

Cursus bekijken

Oefeninstructies

  • Set quantization type to normalized 4-bit to reduce outliers, thus producing less nonsensical answers.
  • Set the compute type to bfloat16 to speed up inference computation speeds.

Praktische interactieve oefening

Zet theorie om in actie met een van onze interactieve oefeningen.

Begin met trainen