VQA with Vision Language Transformers (ViLTs)
Time to have a go with multi-modal generation, starting with Visual Question-Answering (VQA). You will use the dandelin/vilt-b32-finetuned-vqa model to determine the color of the traffic light in the following image:

The preprocessor (processor), model (model), and image (image) have been loaded for you.
Questo esercizio fa parte del corso
Multi-Modal Models with Hugging Face
Istruzioni dell'esercizio
- Preprocess the
textprompt andimage. - Generate the answer tokens from the model and assign to
outputs. - Find the ID of the answer with the highest confidence using the output logits.
Esercizio pratico interattivo
Prova a risolvere questo esercizio completando il codice di esempio.
text = "What color is the traffic light?"
# Preprocess the text prompt and image
encoding = ____(____, ____, return_tensors="pt")
# Generate the answer tokens
outputs = ____
# Find the ID of the answer with the highest confidence
idx = outputs.logits.____
print("Predicted answer:", model.config.id2label[idx])