1. Semantic segmentation with U-Net
Great work so far! It's time to learn about semantic segmentation!
2. Semantic segmentation
In semantic segmentation, we don't distinguish between instances of the same class.
This approach to segmentation is useful for medical imaging or satellite image analysis, among others.
A common neural network architecture for semantic segmentation tasks is the U-Net.
3. U-Net architecture
The U-Net, initially designed for biomedical image segmentation, is named after its architecture's U shape. It consists two parts: an encoder and a decoder.
The encoder, shown in blue, captures the image context through a series of convolutional and pooling layers, reducing height and width while increasing depth of the feature maps. This process is referred to as downsampling.
4. U-Net architecture
The decoder, shown in orange, mirrors the encoder. It gradually upsamples the feature maps using transposed convolutional layers which we will discuss shortly, making feature maps higher and wider but shallower. This leads to the output of the same spatial dimensions as the input, allowing us predict the class for each pixel in the form of a mask.
5. U-Net architecture
The U-Net uses skip connections, shown as gray horizontal arrows in the diagram. Skip connections are direct links from the encoder to the decoder, ensuring the preservation of details lost during downsampling.
Notice how the first encoder block is linked to the last decoder block, the second encoder block to the penultimate decoder block, and so on.
The input to each decoder block consists of concatenated outputs of the previous decoder block, and the corresponding encoder block.
6. Transposed convolution
Transposed convolution is used to upsample feature maps in the decoder part of the U-Net. It increases their height and width, while reducing depth.
The process involves inserting zeros between or around the feature map input pixels. Here, the blue two-by-two feature map is padded with white zeros.
Next, a regular convolution operation is performed on the zero-padded input, resulting in an upsampled feature map with enlarged spatial dimensions.
7. Transposed convolution in PyTorch
The transposed convolutional layer is available from torch.nn as ConvTranspose2d. It accepts parameters similar to regular convolutional layers, including input and output channel numbers, kernel size, and stride. In our U-Net architecture, we'll set both the kernel size and stride to two.
8. U-Net: layer definitions
Let's build a U-Net! We start with the init method where we define the model's layers and layer blocks.
First, the encoder layers consist of four convolutional blocks with ReLU activations, defined using the custom helper function conv_block. This function will be provided for you in the exercises. We assign the encoder blocks to enc1, enc2, and so on.
Notice how each subsequent encoder block increases feature maps' depth. We also define a max pooling layer here.
For the decoder, we define three transpose convolutions decreasing feature maps' depth, assigning them to upconv3, upconv2 and upconv1, respectively.
Finally, we define the convolutional blocks to be used in the decoder.
9. U-Net: forward method
Now, we can use the layers we have defined in the init method to construct the forward method. It receives the input x.
First, the input is passed through the encoder's convolutional blocks while applying the pooling layer before each block.
Next, we implement the decoder part and the skip connections. Since we have defined three upsampling layers, the decoder will consists of three steps.
First, we pass the encoded input to the transposed convolution. Next, we concatenate it with the corresponding encoder output using torch.cat. Finally, we pass the result through a decoder convolution block.
We repeat these for the remaining two decoder steps.
Finally, we return the output of the last decoder step.
10. Running inference
To wrap up, let's use a trained U-Net to produce segmentation masks.
We load the model and put it in evaluation mode. Next, we load this car image and convert it to a tensor. Then, we pass it to the model for inference. The prediction in this case is two masks, one for the background and one for the foreground. Let's display the latter.
11. Let's practice!
It's your turn to build a U-Net!