ComeçarComece de graça

Speeding up inference in quantized models

Your company has been using a Llama model for their customer service chatbot for a while now with quantization. One of the biggest customer complaints you receive is that the bot answers questions very slowly and sometimes produces weird answers.

You suspect this might have to do with quantizing to 4-bit without normalizing. In your investigation, you also suspect that the speed trade-off comes from the inference computations, which are using 32-bit floats.

You want to adjust the quantization configurations to improve the inference speed of your model. The following imports have already been loaded: AutoModelForCausalLM, AutoTokenizer, and BitsAndBytesConfig.

Este exercício faz parte do curso

Fine-Tuning with Llama 3

Ver curso

Instruções do exercício

  • Set quantization type to normalized 4-bit to reduce outliers, thus producing less nonsensical answers.
  • Set the compute type to bfloat16 to speed up inference computation speeds.

Exercício interativo prático

Transforme a teoria em ação com um de nossos exercícios interativos

Começar o exercício