Making models smaller with quantization

1. Making models smaller with quantization

Welcome! In this video we will learn how to make models smaller using quantization so they can be loaded and fine-tuned on lighter-weight hardware.

2. What is quantization?

Model quantization reduces memory and improves inference speed by converting models to a lower precision format. Parameters and activations use fewer bits, from 32-bit floats to 8-bit or 4-bit integers, reducing memory usage. We can adjust the training process when we apply quantization so that the effects of quantizing a model have a minimal impact on the performance.

3. Types of quantization

There are many types of quantization. Weight quantization reduces precision of the model weights. Activation quantization reduces precision in the activation values during inference. Post-Training Quantization, which is what we will focus on for this course, quantizes models after training to save space.

4. Configuring quantization with bitsandbytes

To use quantization, we install the bitsandbytes library via pip. Within our Python training code, we import BitsAndBytesConfig. We instantiate the quantization configurations with BitsAndBytesConfig, choosing the desired precision (4-bit or 8-bit) and quantization type (a regular 4 bit floating point, or normalized, which is a form of quantization more resistant to outlier values). By setting the data type for computations with the quantized models, we can control how precise it will be. It defaults to 32-bit floats but optionally we can set it to 16-bit bfloat for increased efficiency.

5. Loading model with quantization

This is the full configuration. To load the quantized model with the specified configurations, we use AutoModelForCausalLM and try to load the larger Llama 3 8 billion parameter model. To make this possible, we pass our bitsandbytes configuration to the quantization_config parameter. Now the model weights are loaded using a lower byte representation, reducing memory use with a minor trade-off in output quality and longer inference times.

6. Using a quantized model

We can use a quantized model the same way as a regular model. Using the llama3-chat model we just loaded, we can pass it a prompt with the following text asking about the history of Mars. We can tokenize the text with the encode method and pass our promptstr, then generate output tokens with the model and our tokenized inputs, limiting the number of tokens to 200. We then decode the generated part of the output by only looking at the output tokens after all the input tokens. Then, we print the result. Seems like a reasonable summary about the red planet.

7. Finetuning a quantized model

We can't directly fine-tune a quantized model since the weights are discretized. To get around this issue we can use something we have already learned, LoRA adaptation. We can use the SFTTrainer to pass in a quantized model that we loaded with bitsandbytesconfig, and pass in an instance of the LoRA configs using the peft library to peft_config. Then, we can train the model as usual with LoRA adaptation using trainer.train.

8. Let's practice!

Let's practice quantization!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.