U2-Net is a simple and powerful architecture designed for the purpose of salient object detection (SOD). It is a two-level nested U-shaped architecture built using the proposed ReSidual U-blocks (RSU). The U2-Net does not use any pre-trained architecture and is trained from scratch. The architecture comes with two variants: U2-Net and U2-Net lite, which makes it usable in a different environments.
What is Salient Object Detection (SOD)?
Salient Object Detection is the process of segmenting the main object from an input image by producing a high-quality mask. In SOD, we aim to extract the main object from the image with fine details.
Salient Object Detection methods are inspired by the human visual system which is capable of detecting the most salient object from the image while ignoring the rest.
- MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition
- PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model
Issues with Existing Salient Object Detection Methods
Most of the existing SOD methods used the architecture that is pre-trained on the ImageNet classification dataset as the backbone. These backbone networks such as VGG, ResNet, ResNeXt, etc., are designed for image classification tasks. These backbone networks lack the local details and global information which is essential for salient object detection.
These backbone architectures lose the essential high-level details from the high-resolution features during the downsampling process in the initial layers. These backbone architectures focus more on the deeper layers which increases their computational cost.
In most SOD methods, feature aggregation modules are added to existing backbones to extract multi-scale features. This makes the network more complicated and further increases its computational cost.
Main Contributions of the U2-Net
The U2-Net solves the problem from the existing SOD methods, by maintaining the high-resolution features with low memory and computation costs.
U2-Net is two-level nested architecture without using any pre-trained backbone. It achieves competitive performance while training from scratch. The U2-Net is built using the novel ReSidual U-blocks (RSU), which helps in extracting intra-stage multi-scale features without degrading the feature map resolution. The U2-Net follows a U-Net like encoder-decoder structure in which each stage is built using the RSU block.
The method achieves state-of-the-art performance on six datasets and runs in real-time with 30 FPS on an input size of 320 x 320 x 3 on a 1080Ti GPU. The U2-Net lite achieves competitive results against most of the SOTA models at 40 FPS.
The ReSidual U-block (RSU) is designed to capture both local details and global contrast information which are very important for salient object detection. It is inspired by the U-Net symmetric encoder-decoder structure and the residual block. The RSU helps to capture intra-stage multi-scale features more effectively.
The RSU in itself is a small U-Net with a residual connection from the input feature to the output feature.
The ReSidual U-block has the following structure: RSU-L(C_in, M, C_out). Here,
- L – number of layers in the encoder part of the RSU.
- C_in – number of input channels.
- C_out – number of output channels.
- M – number of channels in the internal layers.
The RSU has the following components:
- An input convolution layer, transforms the feature map x = (H x W x C_in) to an intermediate feature map F1(x) = (H x W x C_out).
- U-Net like symmetrical encoder-decoder structure with L number of encoder and decoder layers and a dilated convolution layer connecting them. The use of an encoder-decoder structure helps it to extract both local and global features more effectively.
- In the end, a residual connection fuses both the input F1(x) of the encoder and the output of the decoder. This fusion leads to a combination of the local details from the initial layer and the global information at the end, resulting in a much more effective feature representation.
Architecture of U2-Net
The U2-Net is a nested U-shaped architecture proposed for the purpose of salient object detection. It’s an encoder-decoder structure which consists of six encoder blocks, five decoder blocks, and a saliency fusion module attached to the decoder blocks to generate a final saliency map.
In the encoder blocks, En_1, En_2, En_3 and En_4 are build using Residual U-blocks: RSU-7, RSU-6, RSU-5 and RSU-4 respectively. Here, 7, 6, 5, and 4 refer to the number of encoder layers in the RSU block. The number of encoder layers is configured according to the height and width of the feature map to capture more information.
The En_5 and En_6 are also passed through the RSU-block (RSU-4F), but no downsampling and upsampling are applied to prevent loss of information as the feature maps are of lower resolution. In both En_5 and En_6, dilated convolutions are applied to increase the feature representation.
We have five decoder blocks, De_5, De_4, De_3, De_2 and De_1. Here, De_5 is same as the En_5 and En_6. In the rest of the decoder blocks, we concatenate the feature map with the encoder feature map and followed by the RSU-block. The output of the RSU-block is then upsampled by a factor of 2 by using the bilinear interpolation.
In the saliency fusion module, we take the output from each decoder block and last encoder block (En_6), pass it through a 3×3 convolution and then followed it through a bilinear interpolation. So, we generate six saliency probability maps, which are fused via a concatenation and passed through a 3×3 convolution to generate the final saliency map.
In short, U2-Net is an effective architecture designed for salient object detection, utilizing a two-level nested U-shaped structure and residual U-blocks. With two variants available, the U2-Net does not rely on pre-trained models and is trained from scratch, making it a versatile solution for various environments.