ComeçarComece de graça

Loading 8-bit models

Your company has been using a Llama model for their customer service chatbot for a while now. You've been tasked with figuring out how to reduce the model's GPU memory usage without significantly affecting performance. This will allow the team to switch to a cheaper compute cluster and save the company a lot of money.

You decide to test if you can load your model with 8-bit quantization and maintain a reasonable performance.

You are given the model in model_name. AutoModelForCausalLM and AutoTokenizer are already imported for you.

Este exercício faz parte do curso

Fine-Tuning with Llama 3

Ver curso

Instruções do exercício

  • Import the configuration class to enable loading of models with quantization.
  • Instantiate the quantization configuration class.
  • Configure the quantization parameters to load the model in 8-bit.
  • Pass quantization configuration to AutoModelForCausalLM to load the quantized model.

Exercício interativo prático

Experimente este exercício completando este código de exemplo.

# Import quantization configuration class
from ____ import ____
# Instantiate quantization configuration
bnb_config = ____(
	# Set 8-bit loading
	____=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "Maykeye/TinyLLama-v0",
  	# Set quantization parameters to load quantized model
    ____=bnb_config,
    low_cpu_mem_usage=True
)
Editar e executar o código