ComenzarEmpieza gratis

Loading 8-bit models

Your company has been using a Llama model for their customer service chatbot for a while now. You've been tasked with figuring out how to reduce the model's GPU memory usage without significantly affecting performance. This will allow the team to switch to a cheaper compute cluster and save the company a lot of money.

You decide to test if you can load your model with 8-bit quantization and maintain a reasonable performance.

You are given the model in model_name. AutoModelForCausalLM and AutoTokenizer are already imported for you.

Este ejercicio forma parte del curso

Fine-Tuning with Llama 3

Ver curso

Instrucciones del ejercicio

  • Import the configuration class to enable loading of models with quantization.
  • Instantiate the quantization configuration class.
  • Configure the quantization parameters to load the model in 8-bit.
  • Pass quantization configuration to AutoModelForCausalLM to load the quantized model.

Ejercicio interactivo práctico

Pon en práctica la teoría con uno de nuestros ejercicios interactivos

Empezar ejercicio