Get startedGet started for free

Vision Language models: multi-modal sentiment

Now to integrate your prompt with the Qwen2 Vision Language Model! You'll use the prompt template you created previously, which is available as chat_template.

Let's see what the model thinks about this article! The model (vl_model) and processor (vl_model_processor) have been loaded for you.

This exercise is part of the course

Multi-Modal Models with Hugging Face

View Course

Exercise instructions

  • Use the processor to preprocess chat_template.
  • Use the model to generate the output IDs, making sure to limit the new tokens to 500.
  • Decode the trimmed generated IDs, skipping special tokens.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

text = vl_model_processor.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(chat_template)

# Use the processor to preprocess the text and image
inputs = ____(
    text=[____],
    images=____,
    padding=True,
    return_tensors="pt",
)

# Use the model to generate the output IDs
generated_ids = ____(**inputs, ____)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

# Decode the generated IDs
output_text = ____(
    generated_ids_trimmed, skip_special_tokens=True
)
print(output_text[0])
Edit and Run Code