Instance segmentation with Mask R-CNN

1. Instance segmentation with Mask R-CNN

Let's learn about instance segmentation!

2. Faster R-CNN

We've previously covered Faster R-CNN for object recognition. Given the image, it would predict its class and the bounding box around the object.

3. Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding instance segmentation, retaining a nearly identical architecture with convolutional layers, a Region Proposal Network, and fully connected layers. Mask R-CNN introduces a third model branch that predicts a pixel-to-pixel segmentation mask. This enables accurate instance segmentation.

4. Pre-trained Masked R-CNN in PyTorch

Let's explore using a pre-trained Mask R-CNN model for instance segmentation. We start by importing the maskrcnn_resnet50_fpn from torchvision.models.detection. Next, we load the model with pre-trained weights. Then, we load and convert the test image into a tensor. We will use a photograph of a cat sitting next to a laptop, and we want to detect these two objects. Since the model is pre-trained on the COCO dataset, which includes common objects like animals and computers, it should detect our objects without requiring fine-tuning. Finally, we pass the image tensor to the model to run the inference, saving the result in a variable called "prediction".

5. Model outputs

Examining the Mask R-CNN outputs: "prediction" is a list of length one since we only passed one image to the model. This single list element is a dictionary with a couple of keys. "labels" contains the class IDs of recognized objects. These IDs correspond to the COCO dataset classes which we have stored in the variable class_names. These class names are available on the COCO dataset website. We can see that the top two predicted classes with indices 17 and 73 correspond to a cat and a laptop, respectively. The scores key stores the class probabilities. We can see that the cat was detected with a probability larger than 99% - that's the first value in the tensor - and the laptop with more than 96%. The following values correspond to other, less probable classes. Finally,the masks key stores instance segmentation masks which we will look at next. Additionally, the Mask R-CNN prediction also contains bounding boxes, but we are not interested in them when discussing segmentation.

6. Soft masks

Let's print the unique values of the predicted masks. The values in the segmentation masks produced by Mask R-CNN are not binary (0s and 1s) but are instead floating-point values ranging from 0 to 1. These values represent the model's confidence that each pixel belongs to the object being segmented. These continuous values produce what is known as a "soft mask". Soft masks can provide more nuanced information than binary masks, especially at the boundaries of objects where there might be ambiguity. If we need a binary mask, we can apply a threshold to the soft mask. For example, we might decide that any value above 0.5 should be considered part of the object (set to 1), and any value below 0.5 should be considered background (set to 0).

7. Displaying soft masks

Lets' see how to display a soft mask overlaid on top of the image. We first extract the masks and labels from the prediction. Next, we iterate over the top two predicted objects. For each, we display the original image, and then the mask, setting the color map to "jet" and alpha to 0.5 to make the mask semi-transparent so that it does not obscure the image. Finally, we add class labels to the title and display.

8. Displaying soft masks

The cat mask is very accurate. The one for laptop slightly less so, although the high-confidence red regions are still pretty good.

9. Let's practice!

It's your turn to run instance segmentation!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.