Get startedGet started for free

Zero-shot learning with CLIP

You will use zero-shot learning to classify an image from the rajuptvs/ecommerce_products_clip dataset, which contains around 2k images of products along with associated descriptions:

Image of a woman modeling a dress

The dataset (dataset), CLIPProcessor (processor), and CLIPModel (model) have been loaded for you, as well as a list of categories:

categories = ["shirt", "trousers", "shoes", "dress", "hat", 
              "bag", "watch", "glasses", "jacket", "belt"]

This exercise is part of the course

Multi-Modal Models with Hugging Face

View Course

Exercise instructions

  • Use the processor to preprocess the categories and the image at index 999 of dataset; enable padding.
  • Pass the unpacked inputs into the model.
  • Calculate the probabilities of each category using the .logits_per_image attribute and the .softmax() method.
  • Find the most likely category using probs and categories.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Preprocess the categories and image 
inputs = ____(text=____, images=____, return_tensors="pt", padding=____)

# Process the unpacked inputs with the model
outputs = ____

# Calculate the probabilities of each category
probs = outputs.____.____(dim=1)

# Find the most likely category
category = categories[probs.____.item()]
print(f"Predicted category: {category}")
Edit and Run Code