CommencerCommencer gratuitement

Loading 8-bit models

Your company has been using a Llama model for their customer service chatbot for a while now. You've been tasked with figuring out how to reduce the model's GPU memory usage without significantly affecting performance. This will allow the team to switch to a cheaper compute cluster and save the company a lot of money.

You decide to test if you can load your model with 8-bit quantization and maintain a reasonable performance.

You are given the model in model_name. AutoModelForCausalLM and AutoTokenizer are already imported for you.

Cet exercice fait partie du cours

Fine-Tuning with Llama 3

Afficher le cours

Instructions

  • Import the configuration class to enable loading of models with quantization.
  • Instantiate the quantization configuration class.
  • Configure the quantization parameters to load the model in 8-bit.
  • Pass quantization configuration to AutoModelForCausalLM to load the quantized model.

Exercice interactif pratique

Passez de la théorie à la pratique avec l’un de nos exercices interactifs

Commencer l’exercice