Get startedGet started for free

Image-text similarity

1. Image-text similarity

Welcome back! Let's explore how AI can understand the relationship between images and text.

2. CLIP

To investigate this relationship, we'll be using CLIP, a Contrastive Language-Image Pre-training model developed by OpenAI, which is designed to score similarity between images and text. It is trained using 400M image-text pairs. When given an image and some text, CLIP's image encoder converts the picture into a single array, and a text encoder does the same with the words. The model trains the separate encoders to produce similar arrays when the image matches the words.

3. Zero-shot learning

We'll see how can CLIP can be used for zero-shot classification. Zero-shot learning is a powerful paradigm where AI models can handle tasks they weren't explicitly trained for. Unlike traditional machine learning models that require labeled examples of every category, zero-shot models can make educated guesses about new concepts based on their understanding of related concepts and descriptions.

4. Use case: product categorization

Let's look at a practical example using an e-commerce dataset. Here we're loading a collection of product images and descriptions using the Hugging Face datasets library. In this example, we have a clothing item - specifically a printed long-sleeved shirt. CLIP can help us quantify the overlap of extracted features in the image and its description, which is crucial for tasks like automatic categorization, search, and recommendation systems in e-commerce platforms.

5. Zero-shot learning with CLIP

First, we load the pre-trained CLIP model and its processor. Next, we define a list of possible product categories - shirts, trousers, shoes, and so on for the zero-shot classification. In the final step, we use the processor to prepare our example product image and the list of possible categories, then feed them both through the model. CLIP will compare how well the image matches each category, giving predictions without needing to see a single training example from our dataset. The .logits_per_image attribute returns a list of similarities between each category and the image. We can use the .softmax() method to convert this into a probability. With the .item() method of .argmax(), we obtain the index of the category with the highest probability; in this case, the shirt.

6. The CLIP score

The CLIP score measures how close the text encoding of the CLIP model is to the image encoding. This will be 100 for perfect agreement and 0 for no agreement. We load the example image and its description from the dataset, convert the image into a tensor, and scale by 255 so the pixel color values are back in the original form, as the clip_score() function will apply its own scaling. The function then takes the image, description, and model. A score of 28 is reasonable and likely would be higher if the description was a little more concise.

7. Let's practice!

Let's practice zero-shot learning and assess the quality of image descriptions.