Get startedGet started for free

Multi-modal sentiment analysis

1. Multi-modal sentiment analysis

Let's explore how to perform sentiment analysis with different modalities at the same time.

2. Vision Language Models (VLMs)

Vision Language Models, or VLMs, represent a powerful approach to combining visual and textual understanding. These models process images and text through separate encoders, then fuse these features together to create a unified representation. They employ extensive pretraining, which allows the models to learn shared representations between these modalities, and enables a single model to handle multiple visual reasoning tasks.

3. Visual reasoning tasks

Visual reasoning tasks require AI systems to understand and analyze images. Examples include: Visual Question Answering, or VQA, where we can ask questions about an image, such as "What food is in this photo?"

4. Visual reasoning tasks

Matching tasks, where we check if statements accurately describe what is shown in the image;

5. Visual reasoning tasks

and entailment, where we can determine if an image logically supports the semantics of the text, highlighting potential contradictions between the text and image.

6. Use case: share price impact

We'll use a VLM to extract the sentiment of Ford's share price based an article from a BBC news dataset. In this example, we extract both the top image, the article header image, and text content from the article indexed 87, which discusses Ford's investment decisions in Mexico.

7. Qwen 2 VLMs

We'll use a Qwen2 VLM for this. We import the necessary class from transformers and specify the model checkpoint. From the qwen_vl_utils library, we import the process_vision_info() function, which we'll use in a moment.

8. The preprocessor

We load the Qwen2VLProcessor class for the model, which will process both the images and text, and decode the generated tokens back to natural language. When defining the processor, we specify the minimum and maximum number of pixels in our images, which is hard-coded here, but is commonly extracted by analyzing the dataset.

9. Multi-modal prompts

To combine the text and image inputs of the news article, we need to create a multi-modal prompt for the model. This particular VLM is instruction-tuned, compared to the pre-trained models we've worked with up until now. All this means is that it's been trained to accept and return messages in a conversational format, typically a list of dictionaries, where message content is sent and received from a particular role. This template combines both the article's image and our text query into a single multi-modal prompt.

10. VLM classification

To start inferencing with our VLM, we apply the chat template we defined to the processor so it can access the text, turning off tokenization and adding a prompt for downstream generation. The process_vision_info() function we imported earlier extracts the images from the chat template. The processor then encodes both the text and images into tensors that the model can understand, and adding padding tokens. We then use the model to generate a response limited to 500 new tokens by setting the max_new_tokens flag. We use a list comprehension to cut the IDs related to the input prompt from the generated_ids, so we only get the output.

11. VLM classification

Finally, we use the .batch_decode() method of the preprocessor to decode back to text. The result shows a negative sentiment of Ford's stock price, including an explanation of the assignment as we asked for in the prompt.

12. Let's practice!

Time to try this out!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.