PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

PP-LiteSeg is a lightweight encoder-decoder architecture designed for real-time semantic segmentation.

The block diagram of the PP-LiteSeg architecture.
The block diagram of the PP-LiteSeg architecture.

It consists of three modules:

  1. Encoder: Lightweight network
  2. Aggregation: Simple Pyramid Pooling Module (SPPM)
  3. Decoder: Flexible and Lightweight Decoder (FLD) and Unified Attention Fusion Module (UAFM)

Encoder

The STDCNet is the encoder for the proposed PP-LiteSeg for its high performance.

The research paper presents two versions of the PP-LiteSeg: PP-LiteSeg-T and PP-LiteSeg-B, where STDC1 and STDC2 are used as encoders respectively.

ModelEncoderChannels in Decoder
PP-LiteSeg-TSTDC132, 64, 128
PP-LiteSeg-SSTDC264, 96, 128
The details of our proposed PP-LiteSeg.

Flexible Lightweight Decoder (FLD)

  • FLD gradually decreases the channels of the features from high level to low level.
  • FLD can easily adjust the computation cost to achieve a better balance between the encoder and the decoder.
  • The FLD consists of a set of convolutional layers along with the Unified Attention Fusion Module (UAFM) to enrich the features by using channel and spatial attention.

Unified Attention Fusion Module (UAFM)

Unified Attention Fusion Module
  • UAFM applies channel and spatial attention to enrich the fused feature representations.
  • The UAFM takes two sets of input feature maps: F_high and F_low. F_high is the feature map from the previous block and the F_low is the feature from the respective encoder block.
  • The F_high feature map is first upsampled by using a bilinear interpolation. Next, both the F_high and F_low are passed through the Attention Module and produce the weight α.
  • Next, we perform an element-wise multiplication of F_high and F_low with the weight α and 1-α respectively and then we perform an element-wise addition and give the output.

Spatial Attention Module (SAM)

Spatial Attention Module
Spatial Attention Module

The spatial attention module exploits the inter-spatial relationship from the input feature map and produces a spatial attention map. This spatial attention map highlights the region of interest in the input feature and supress the irrelevant.

Channel Attention Module (CAM)

Channel Attention Modul
Channel Attention Module

The channel attention module exploits the inter-channel relationship from the input feature map and produces a vector representing the importance of each channel present in the input feature map.

Simple Pyramid Pooling Module (SPPM)

Simple Pyramid Pooling Module
The block diagram of the Simple Pyramid Pooling Module.
  • SPPM connects the encoder and decoder of the PP-LiteSeg or it acts as a bridge between the encoder and the decoder.
  • It begins with a global average pooling with three different bin sizes i.e., 1×1, 2×2, and 4×4.
  • The three output features are followed by a sequence of 1×1 convolution, batch normalization and ReLU activation function.
  • Next, all three features are upsampled and added together to produce a single fused feature map.
  • Finally, it is then followed by 1×1 convolution, batch normalization and ReLU activation function.

Datasets and Implementation Details

  • The model has trained two different datasets: Cityscapes and CamVid.
  • Stochastic Gradient Descent (SGD) is used as an optimizer with a momentum of 0.9.
  • The “poly” learning rate scheduler is used while training the PP-LiteSeg.
  • For data augmentation, random scaling, random cropping, random horizontal flipping, random color jittering and normalization are used. The random scale ranges in [0.125, 1.5], [0.5, 2.5] for Cityscapes and Camvid respectively.

Results

The comparisons with state-of-the-art real-time methods on Cityscapes.
The comparisons with state-of-the-art real-time methods on Cityscapes.
The comparisons with state-of-the-art real-time methods on CamVid test set. The input resolution of all methods is 960 × 720.
The comparisons with state-of-the-art real-time methods on CamVid test set. The input resolution of all methods is 960 × 720.

Summary

In this paper, the authors focused on designing a semantic segmentation architecture for real-time performance. To achieve this the authors have developed novel modules such as Flexible and Lightweight Decoder (FLD) and Unified Attention Fusion Module (UASM) to strengthen the feature representations. Simple Pyramid Pooling Module (SPPM) is further used to aggregate global contextual information with low-overhead computation cost. By combining all these novel modules, the authors developed PP-LiteSeg, which has achieved top performance along with high FPS with high-resolution input.

Nikhil Tomar

I am an independent researcher in the field of Artificial Intelligence. I love to write about the technology I am working on.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *