Get startedGet started for free

Computer vision

1. Computer vision

Let's explore the possibilities of using Hugging Face for computer vision tasks!

2. Vision models

Vision models extract features from 2D images using models designed to account for spatial correlations. These learned features can be used to perform various tasks such as classification, in which the image receives a class label, object detection, which finds the location of targets within images in the form of bounding boxes, and segmentation, which provides a label for every pixel in an image indicating which segment it is a part of.

3. Classification

Let's try using a model to classify an image from the Flickr30k dataset. We can pass the image directly to an image classification pipeline. The 224 at the end of the model checkpoint name indicates that this model will resize images to 224x224 pixels. Calling the pipeline on the image, we can see it correctly identifies the sport as baseball.

4. Object detection

Let's consider another example from the flickr30k dataset showing a martial arts contest. In object detection, the goal is to surround detected objects in an image with a bounding box. Let's find out how to determine the box coordinates and plot them on the image.

5. Object detection

We'll be using the popular ResNet object detection model trained by Meta in an object-detection pipeline. When using the pipeline to process an image, the threshold argument can be used to filter outputs based on a confidence score, so in this case, only objects that the model is 95% confident about or higher will be assigned boxes. The model will output a list of detected boxes, which are dictionaries consisting of the detected label, the confidence score the model has, and the coordinates of the bounding box in the image as keys. We can see the model found five people with greater than 95% confidence.

6. Object detection

We can plot the detected objects using the patches module from matplotlib. We first add the image to the plot with .imshow(). Next, we extract the boxes again from the model output, and use the coordinates to define rectangles for each detected object. These rectangles are provided by patches, which can help with plotting different shapes. We color these rectangles, add them to the plot, and display the results. We can see that our choice of threshold has identified people in both the foreground and background.

7. Segmentation

Finally, segmentation models provide pixel-level annotations. This means that a 2D array is produced by the model with the same dimensions as the input. One of the use cases for image segmentation is background removal. In this case, each pixel is labeled 1 or 0 depending on whether it is classified as foreground or background, respectively. These binary labels are then often multiplied to the original image to set pixels identified as background to white, leaving everything else unchanged. Let's apply background removal to the martial arts image we just saw.

8. Segmentation

We'll use the RMBG model here in an image-segmentation pipeline. To use this model, transformers needs to download additional code, which is why trust_remote_code is set to True. These outputs can be directly plotted with .imshow(), and we can see most of the background audience were removed.

9. Let's practice!

In the next video, we'll use fine-tuning to tailor our model to improve performance for specific tasks, but for now, let's practice!