Get startedGet started for free

Multimodal QA tasks

1. Multimodal QA tasks

Let's now move from multi-modal classification tasks to multi-modal generation, starting with visual question-answering, or VQA.

2. Multimodal QA tasks

VQA models have separate encoders to process the question text and image individually and extract

3. Multimodal QA tasks

mode-specific features. These encoded features are combined to connect related image regions and questions.

4. Multimodal QA tasks

Finally, predictions are generated based on the combined representation. Let's give this a go, starting with processing the images and text.

5. VQA

Using the Python Imaging Library, known as PIL, together with requests, we can load an image from a URL with the Image.open() and requests.get() methods. We can see we have a picture of a wild elephant. The corresponding text that we'll use is to ask the model what animal is present in the photo.

6. VQA

The large datasets of image and text data used for pretraining image-text models means that a wide variety of object types are known. We've already seen how models can match encodings of images and text. This means that models can be reused for multiple purposes, often without needing to be retrained or fine-tuned.

7. VQA

We'll use a vision language transformer model from Dandelin for this task, which is fine-tuned on labeled VQA datasets. We use the processor to encode both our image and question text together. We pass these encoded inputs into the model, unpacking them with double asterisks, to generates logits. We can find the index of the highest probability answer using .argmax() followed by .item(). Finally, we convert this index into a human-readable answer using the index we just extracted and the id2label mapping from model.config. The model was successfully able to identify the animal in the image!

8. Document-text to text

Document VQA builds on VQA with additional steps to detect structured elements like graphs and tables through optical character recognition, or OCR. Here's an example finance document, which has already been converted into an image for model input.

9. Document-text to text

The Tesseract project from Google underpins the majority of OCR models, but it's not installed by default with other Hugging Face libraries. We not only need to install the Python pytesseract package, but also the tesseract package. The tesseract package installation differs depending on the operating system, so check out the package installer for more information. Once we have this installed, we're ready to run our pipelines!

10. Document-text to text

In order to extract information from free text, we will use a LayoutLM model, trained using question-answer pairs from the DocVQA dataset. We can create a pipeline with the document-question-answering task and model checkpoint, and pass the pipe the image and the prompt.

11. Document-text to text

The result is a list of dictionaries containing an answer and confidence score. We can see when comparing to the original document that the model does a good job of extracting the relevant information.

12. Let's practice!

Let's see how we can use these models for other tasks!