[Paper Summary] EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

This post will analyze the research paper “EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation.” We will discuss the problems with existing medical image segmentation methods and how the given method (EMCAD) solves these issues.

What is EMCAD?

EMCAD is a newly developed efficient multi-scale convolutional attention decoder designed to optimize performance and computational efficiency. It features multi-scale depth-wise convolution blocks, channel attention, spatial attention, and large-kernel gated attention mechanisms, which make this method highly effective.

By employing group and depth-wise convolutions, EMCAD is highly efficient, requiring only 1.91 million parameters and 0.381 GFLOPs when paired with a standard encoder.

Through comprehensive evaluations across 12 datasets spanning six different medical image segmentation tasks, EMCAD has demonstrated state-of-the-art (SOTA) performance. Compared to other methods, it reduces 79.4% in parameters and 80.3% in computational costs (FLOPs).

Research Paper: EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation

Problems with Existing Methods

The two main problems with existing methods are:

Computationally expensive.
Self-attention lacks local spatial context.

Computationally Expensive

Attention mechanisms have been added to improve the accuracy of pixel-level classification in medical images. While they boost performance, these models still rely on heavy convolutional blocks, which are resource-intensive. This makes them slower and harder to use in real-world scenarios, especially with limited computing power.

Self-Attention Lacks Local Spatial Context

Self-attention (SA) captures the big picture by focusing on global information. However, it struggles to understand more minor, detailed spatial relationships within the image. Some methods try to fix this by using local convolutional attention in the decoders to capture these finer details, but this comes at the cost of even more computational power.

EMCAD addresses these issues by using an efficient multi-scale depth-wise convolution block. This approach improves feature maps through multi-scale convolutions and efficiently captures global and local spatial information without the high computational demands of older methods.

ALSO READ:

Contributions

The contributions of the given paper are as follows:

New Efficient Multi-scale Convolutional Decoder: Introduces an efficient multi-scale cascaded fully-convolutional attention decoder (EMCAD). It has only 0.506M parameters and 0.11G FLOPs for a tiny encoder with #channels = [32, 64, 160, 256]. It has 1.91M parameters and 0.381G FLOPs for a standard encoder with #channels = [64, 128, 320, 512].

Efficient Multi-scale Convolutional Attention Module (MSCAM): MSCAM performs depth-wise convolutions at multiple scales, which makes it efficient.

Large-kernel Grouped Attention Gate: A new grouped attention gate to fuse refined features with the features from skip connections. It uses larger kernel (3×3) group convolutions instead of a point-wise (1×1) convolutions.

Improved Performance: We empirically show that EM-CAD can be used with any hierarchical vision encoder (e.g., PVTv2-B0, PVTv2-B2) while significantly improving the performance of 2D medical image segmentation. EMCAD produces better results than SOTA methods with a significantly lower computational cost on 12 medical image segmentation benchmarks belonging to six different tasks.

Proposed Method for Medical Image Segmentation

The proposed method uses the following blocks:

Efficient multi-scale convolutional attention decoding (EMCAD)
Large-kernel grouped attention gate (LGAG)
Multi-scale convolutional attention module (MSCAM)
Efficient up-convolution block (EUCB)
Segmentation head (SH)

Efficient multi-scale convolutional attention decoding (EMCAD)

EMCAD processes the multi-stage features extracted from pre-trained hierarchical vision encoders. It consists of the following:

Multi-scale convolutional attention module (MSCAM): These are used to enhance the feature representation robustly.
Large-kernel grouped attention gate (LGAG): LGAGs help refine the feature maps by fusing with the skip connection via a gated attention mechanism.
Efficient up-convolution block (EUCB): These EUCBs help to upsample the feature maps.
Segmentation Head (SHs): The Segmentation Heads produce the segmentation mask.

Large-kernel grouped attention gate (LGAG)

LGAG is used to combine feature maps with the attention coefficient progressively. It allows higher activation of relevant features and suppression of irrelevant ones. LGAG utilizes 3 × 3 group convolution instead of the 1 × 1 convolution used in Attention UNet. Due to using 3 × 3 kernel group convolutions, LGAG captures comparatively larger spatial contexts with less computational cost.

LGAG employs a gating signal derived from higher-level features to control the flow of information across different stages of the network, thus enhancing its precision for medical image segmentation.

It is spatial attention, where 3 × 3 kernel group convolutions are utilized instead of 1 × 1 convolution.

It begins with two inputs, a gating signal (g) and an input feature map (x). Each is followed by a 3 × 3 group convolution and a batch-normalization. After that, we add both the output feature maps and pass them through the ReLU activation function. Next follows a 1 × 1 convolution and sigmoid activation function, which gives us the attention coefficient. Now, we perform an element-wise multiplication of the input feature map (x) with the attention coefficient to scale the feature accordingly.

Multi-scale convolutional attention module (MSCAM)

The MSCAM, as its name highlights, provides both an efficient multi-scale extraction and attention to the feature maps. This way, it enhances the feature representation and boosts the segmentation performance.

The MSCAM consists of multiple different sub-modules, which are as follows:

Multi-scale Convolution Block (MSCB)
Channel Attention Block (CAB)
Spatial Attention Block (SAB)

The MSCAM begins with a Channel Attention Block and is followed by a Spatial Attention Block. After that, it is followed by the Multi-Scale Convolution Block. Due to the use of depth-wise convolution layers, the MSCAM is more effective and has significantly lower computational costs.

Multi-Scale Convolution Block (MSCB)

The Multi-Scale Convolution Block follows the design of MobileNetV2’s inverted residual block (IRB). It performs a depth-wise convolution at multiple scales and then shuffles the channels.

It begins with a 1×1 convolution layer, doubling the number of output channels, followed by a batch normalization and a ReLU6 activation function. After that, Multi-Scale Depth-wise Convolution (MSDC) is used to capture multi-scale and multi-resolution contexts using three parallel depth-wise convolution layers with different kernel sizes.

It is followed by batch normalization, ReLU activation, and a channel shuffle function. While depth-wise convolution overlooks the relationships among channels, a channel shuffle operation incorporates these relationships.

After that, a 1×1 convolution, batch normalization, and a residual connection are used to deal with the exploding and vanishing gradient problem.

Channel Attention Block (CAB)

Each channel carries different information in the convolution layer and has its importance. Because they cannot be treated equally, we use the Channel Attention Block to assign different levels of importance to each channel.

It begins by applying the adaptive maximum pooling and adaptive average pooling to the spatial dimensions (i.e., height and width). We apply a 1×1 convolution on each reduced representation while reducing the number of output channels r = C * 1/16. Next, it is followed by a ReLU activation, and again a 1×1 convolution with a number of output channels (C) equal to the input feature map. We then add the outputs from both layers and pass them through a sigmoid activation function to get the attention weights. Finally, we apply an element-wise multiplication between the input feature map and the attention weights to get a more refined feature map.

Spatial Attention Block (SAB)

The Spatial Attention Block helps you focus on specific parts of the input feature map. It simply tells you where to focus on the feature map, suppressing irrelevant regions and automatically highlighting the important regions.

It begins by applying average-pooling and max-pooling operations along the channel axis and concatenating their outputs to get a unified representation. Next, it is followed by a 7×7 convolution layer to increase the local receptive field, which enhances the local contextual relationships among features. After that, we apply a sigmoid activation function to calculate the attention weights. Finally, we apply an element-wise multiplication between the input feature map and the attention weights to get a more refined feature map.

Efficient up-convolution block (EUCB)

It progressively upsamples the feature maps of the current stage to match with the next skip connection.

It begins with an upsampling by a factor of two, followed by a 3×3 depth-wise convolution, batch normalization, and ReLU activation function. Finally, it is followed by a 1×1 convolution with the same output channels in the next stage.

Segmentation head (SH)

It produces the segmentation output by applying a 1 × 1 convolution, with the number of output channels equal to the number of classes.

Overall Architecture

The proposed EMCAD decoder combines tiny (PVTv2-B0) and standard (PVTv2-B2) PVTv2 networks. As the SOTA results show, the decoder adapts seamlessly with the pre-trained encoders.

Two architectures, PVT-EMCAD-B0 and PVT-EMCAD-B2, were developed using PVTv2-B0 (Tiny) and PVTv2-B2 (Standard) encoders.

Both architectures have four segmentation heads, which produce four prediction maps: p1, p2, p3, and p4. Out of these, p4 is considered the final segmentation map, which is then passed through either a sigmoid or softmax activation function, depending on the segmentation task.

Multi-Stage Loss

The loss involves calculating the loss for all possible combinations of predictions derived from 4 heads, totaling 2⁴ − 1 = 15 unique predictions, and then summing these losses. We focus on minimizing this cumulative combinatorial loss during the training process.

For binary segmentation, we optimize the additive loss with all the individual predictions, each with its own weight. We add an additional term, where all four predictions are added, and the loss is calculated.

Here, α = β = γ = ζ = δ = 1.0 are the weights assigned to each loss.

Datasets

The proposed EMCAD has trained the following medical image segmentation datasets.

Polyp Segmentation
- CVC-ClinicDB
- CVC-ColonDB
- ETIS-Larib
- Kvasir-SEG
- BKAI-IGH
Skin Lesion Segmentation
- ISIC 2017
- ISIC 2018
Cell Segmentation
- 2018 Data Science Bowl
- EM
Breast Cancer Segmentation
- BUSI
Abdomen Organ Segmentation
- Synapse Multi-organ dataset
Cardiac Organ Segmentation
- ACDC dataset

Results

The proposed models (PVT-EMCAD-B0 and PVT-EMCAD-B2) are evaluated with CNN and transformer-based segmentation methods on 12 datasets from six medical image segmentation tasks. The results show that EMCAD achieves state-of-the-art (SOTA) performance.

Quantitative results on Polyp Segmentation, Skin Lesion Segmentation, Cell Segmentation and Breast Cancer Segmentation:

Qualitative results on the Polyp Segmentation:

Quantitative results on Synapse Multi-organ dataset:

Qualitative results Synapse Multi-organ dataset:

Quantitative results of cardiac organ segmentation on ACDC dataset:

Ablation Study

The ablation study is conducted to test the different components of the proposed EMCAD:

Effect of different components of EMCAD
Effect of multi-scale kernels in MSCAM
Comparison with the baseline decoder
Parallel vs. sequential depth-wise convolution
Effectiveness of our large-kernel grouped attention gate (LGAG) over attention gate (AG)
Effect of transfer learning from ImageNet pre-trained weights
Effect of deep supervision
Effect of Input Resolutions

Effect of different components of EMCAD

Experiments show that adding modules like Cascaded Structure, LGAG, and MSCAM to the decoder improves performance, with MSCAM being the most effective. Using LGAG and MSCAM together gives the highest DICE score of 83.63%, with a 3.53% improvement, although it increases FLOPs and parameters.

Effect of multi-scale kernels in MSCAM

Tests on different kernel sizes for depth-wise convolutions reveal that combining 1×1, 3×3, and 5×5 kernels enhances performance, achieving the best results on two datasets. Larger kernels (7×7, 9×9) reduce performance, so [1, 3, 5] kernels are chosen for the experiments.

Comparison with the baseline decoder

The EMCAD decoder outperforms the CASCADE baseline with significantly fewer parameters and FLOPs. EMCAD with PVTv2-b2 and PVTv2-B0 achieves better DICE scores by reducing computational complexity by over 70%.

Parallel vs. sequential depth-wise convolution

Experiments show that parallel depth-wise convolutions offer slightly better performance (0.03% to 0.15%) and more consistent results with lower standard deviations than sequential ones. Hence, parallel convolution is chosen for all experiments.

Effectiveness of our large-kernel grouped attention gate (LGAG) over attention gate (AG)

LGAG significantly improves DICE scores while reducing parameters (up to 91.17%) and FLOPs (up to 83.03%) compared to AG. The performance gains are more pronounced in larger models, demonstrating LGAG’s scalability.

Effect of transfer learning from ImageNet pre-trained weights

Transfer learning from ImageNet significantly improves DICE, mIoU, and HD95 scores, with a greater impact on smaller models like PVT-EMCAD-B0. Performance improvements are observed across most organs except the Gallbladder.

Effect of deep supervision

Deep Supervision (DS) slightly improves DICE scores in six out of seven datasets, with the largest impact on the Synapse multi-organ dataset, confirming its benefit in improving segmentation performance.

Effect of Input Resolutions

Higher input resolutions improve DICE scores but increase FLOPs. PVT-EMCAD-B0 is more efficient at higher resolutions, while PVT-EMCAD-B2 achieves the best DICE score with more computational cost, indicating B0’s suitability for larger inputs.

Conclusion

In this paper, the authors introduce EMCAD, a novel multi-scale convolutional attention decoder tailored for enhancing medical image segmentation. EMCAD utilizes a multi-scale depth-wise convolution block to effectively aggregate and refine features across different scales within feature maps. This approach enhances precision in medical image segmentation by capturing diverse scale information efficiently compared to traditional 3 × 3 convolution blocks.

Experimental results demonstrate that EMCAD outperforms the CASCADE decoder regarding DICE scores while achieving significant reductions in model complexity. Specifically, EMCAD requires 79.4% fewer parameters and 80.3% fewer computational operations (FLOPs) than CASCADE. Extensive evaluations across 12 public datasets covering various 2D medical image segmentation tasks consistently show EMCAD’s superior performance compared to state-of-the-art methods.

Moreover, EMCAD’s compatibility with more miniature encoders makes it suitable for point-of-care applications without compromising performance. The authors anticipate that EMCAD will significantly advance medical image and semantic segmentation tasks.