1. Learn
  2. /
  3. Courses
  4. /
  5. Multi-Modal Models with Hugging Face

Connected

Exercise

VQA with Vision Language Transformers (ViLTs)

Time to have a go with multi-modal generation, starting with Visual Question-Answering (VQA). You will use the dandelin/vilt-b32-finetuned-vqa model to determine the color of the traffic light in the following image:

Picture of a traffic light showing red

The preprocessor (processor), model (model), and image (image) have been loaded for you.

Instructions

100 XP
  • Preprocess the text prompt and image.
  • Generate the answer tokens from the model and assign to outputs.
  • Find the ID of the answer with the highest confidence using the output logits.