\corraddr

minghongxie@163.com.

MIPANet: Optimizing RGB-D Semantic Segmentation through Multi-modal Interaction and Pooling Attention

Shuai Zhang 1 Minghong Xie\corrauth 1, \addr¹¹affiliationmark: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500 Yunnan, PR China

Abstract

Semantic segmentation of RGB-D images involves understanding the appearance and spatial relationships of objects within a scene, which requires careful consideration of various factors. However, in indoor environments, the simple input of RGB and depth images often results in a relatively limited acquisition of semantic and spatial information, leading to suboptimal segmentation outcomes. To address this, we propose the Multi-modal Interaction and Pooling Attention Network (MIPANet), designed to harness the interactive synergy between RGB and depth modalities, optimizing the utilization of complementary information. Specifically, we incorporate a Multi-modal Interaction Module (MIM) into the deepest layers of the network. This module is engineered to facilitate the fusion of RGB and depth information, allowing for mutual enhancement and correction. Additionally, we introduce a Pooling Attention Module (PAM) at various stages of the encoder to enhance the features extracted by the network. The outputs of the PAMs are selectively integrated into the decoder to improve semantic segmentation performance. Experimental results demonstrate that MIPANet outperforms existing methods on two indoor scene datasets, NYUDv2 and SUN-RGBD, by optimizing the insufficient information interaction between different modalities in RGB-D semantic segmentation.

keywords:

RGB-D Semantic Segmentation;Attention Mechanism;Multi-modal Interaction

1 Introduction

In recent years, Convolutional Neural Networks (CNN) have been widely used in image semantic segmentation, and more and more high-performance models have gradually replaced the traditional semantic segmentation methods. With the introduction of Fully Convolutional Neural Networks (FCN) [1, 2], which show great potential in semantic segmentation tasks, many researchers have proposed improved semantic segmentation models based on this way. Nevertheless, semantic segmentation remains a formidable challenge in some indoor environments, given the intricacies such as variations in illumination and mutual occlusion between objects.

With the widespread application of depth sensors and depth cameras [3], the research on images is not limited to RGB color images, but the research on RGB-Depth (RGB-D) images containing depth information. RGB images can provide appearance information such as the color and texture of objects, in contrast, depth images can provide three-dimensional geometry information of objects, which is missing in RGB images and is desired for indoor scenes. References [4, 5] simply splice RGB features and depth features to form a four-channel input, improving the accuracy of semantic segmentation. Reference [6] convert depth images into three distinct channels (horizontal disparity, height above ground, and angle of surface normals) to obtain the HHA image, then input the RGB features and HHA features into two parallel CNNs to predict the probability maps of two semantic segmentations, respectively, and fuse them in the last layer of the network as the final segmentation result. Though the above methods have achieved good results in the task of RGB-D semantic segmentation, most RGB-D semantic segmentation [7, 8, 9, 10] simply merges RGB features and depth features by concatenation or summation. As a result, the information differences between the multimodal cannot be solved effectively, which will generate CNN not to use the complementary information between them fully, resulting in object and background confusion. For example, The printer and trash bin in Fig. 1 (a) are prone to be inaccurately assimilated into the background.

Refer to caption — Figure 1: Improve segmentation accuracy by leveraging depth features within our MIPANet. The prediction result can accurately distinguish the trash can and printer from the background.

To solve the above problems, we propose an RGB-D semantic segmentation of the Indoor Scene network, MIPANet. Fig. 2 illustrates the overall structure of the network. The network is an encoder-decoder architecture, including two innovative feature fusion modules: The multi-modal Interaction Module(MIM) and the Pooling Attention Module(PAM). This paper integrates the two fusion modules into an encoder-decoder architecture. The encoder is composed of two identical CNN branches, each specifically designed for extracting RGB features and depth features, respectively. In this study, RGB and depth features are extracted and fused incrementally across various network levels, optimizing semantic segmentation results utilizing spatial disparities and semantic interdependencies among multimodal features. In the PAM, we use adaptive averaging instead of global averaging, which approach not only allows for flexible adjustment of the output size but also preserves more spatial information, facilitating enhanced extraction of depth features. In MIM, we obtain two sets of Q,K,V for different modalities and perform calculations using the Q,K from one set and V from the other. This achieves information interaction between the RGB and depth modalities. This paper’s main contributions can be summarized as follows:

$\bullet$ We introduce an end-to-end multi-modal fusion network, MIPANet, incorporating multi-modal interaction and pooling attention. This innovative approach optimizes integrating complementary information from RGB and depth features, effectively tackling the challenge posed by insufficient cross-modal feature fusion in RGB-D semantic segmentation.

$\bullet$ We present two cross-modal feature fusion methods. Within the MIM, a cross-modal feature interaction and fusion mechanism were developed. RGB and depth features are collaboratively optimized using attention masks to extract partially detailed features. In addition, PAM integrates intermediate layer features into the decoder, enhancing feature extraction and supporting the decoder in upsampling and recovery.

$\bullet$ Experimental results confirm the effectiveness of our proposed RGB-D semantic segmentation network in accurately handling indoor images in complex scenarios. The model demonstrated superior semantic segmentation performance compared to other methods on the publicly available NYUv2 and SUN RGB-D datasets.

2 Related Work

In this section, we provide a comprehensive review of three parts: (1) RGB-D Semantic Segmentation, (2) Attention Mechanism, and (3) Cross-modal Interaction.

2.1 RGB-D Semantic Segmentation

With the widespread application of depth sensors and depth cameras in the field of depth estimation [11, 9, 12, 13], people can obtain the depth information of the scene more conveniently, and the research on the image is no longer limited to a single RGB image. RGB-D semantic segmentation task is to efficiently integrate RGB features and depth features to improve segmentation accuracy, especially in some indoor scenes. Couprie et al. [4] proposed an early fusion approach, which simply concatenates an image’s RGB and depth channels as a four-channel input to the convolutional neural network. Wang et al. [6] separately input RGB features and HHA features into two CNNs for prediction and perform fusion in the final stage of the network, and [14] introduced an encoding-decoding network, employing a dual-branch RGB encoder to extract features separately from RGB images and depth images. The studies mentioned above employed equal-weight concatenation or summation operations to fuse RGB and depth features without fully leveraging the complementary information between different modalities. In recent years, some research has proposed more effective strategies for RGB-D feature fusion. Hu et al. [15] utilised a three-branch encoder that includes RGB, Depth, and Fusion branches, efficiently collecting features without breaking the original RGB and deep inference branches. Seichter et al. [16] have presented an efficient RGB-D segmentation approach, characterised by two enhanced ResNet-based encoders utilising an attention-based fusion for incorporating depth information. However, these methods did not fully exploit the differential information between the two modalities and the intermediate-level features extracted by the convolutional network.

2.2 Attention Mechanism

In recent years, attention [17, 18, 19, 20, 21, 22] has been widely used in computer vision and other fields. Vaswani et al. [17] proposed the self-attention mechanism, which has had a profound impact on the design of the deep learning model. Fu et al. [19] proposed DANet, which can adaptively integrate local features and their global dependencies. Wang et al. [23] utilised spatial attention in an image classification model. Through the backpropagation of a convolutional neural network, they adaptively learned spatial attention masks, allowing the model to focus on the significant regions of the image. SENet [24] has proposed channel attention, which adaptively learns the importance of each feature channel through a neural network. Woo et al. [22] incorporates two attention modules that concurrently capture channel-wise and spatial relationships. ECA-Net [25] introduces a straightforward and efficient ”local” channel attention mechanism to minimize computational overhead. MFC [26]introduced a multi-frequency domain attention module to capture information across different frequency domains. Similarly, CAMNet [2] proposed a contrastive attention module designed to amplify local saliency. Building upon this foundation, Huang et al. [27] proposed a cross-attention module that consolidates contextual information both horizontally and vertically, which can gather contextual information from all pixels. These methods have demonstrated significant potential in single-mode feature extraction. To effectively leverage the complementary information between different modalities, this paper introduces a Pooling Attention module that learns the differential information between two distinct modalities and fully exploits the intermediate-level features in the convolutional network and long-range semantic dependencies between modalities.

2.3 Cross-modal Interaction

With the development of sensor technology, different types of sensors can provide a variety of modal information for semantic segmentation tasks to achieve information interaction [28, 29, 30, 31, 32] between RGB mode and other modes. The interaction between RGB and infrared modalities enhanced the effectiveness of semantic segmentation in RGB-T scenarios. Xiang et al. [33] used a single-shot polarization sensor to build the first RGB-P dataset, incorporated polarization sensing to obtain supplementary information, and improved the accuracy of segmentation for many categories, especially those with polarization characteristics, such as glass. HPGN [34] proposes a novel pyramid graph network targeting features, which is closely connected behind the backbone network to explore multi-scale spatial structural features. GiT [35] proposes a structure where graphs and transformers interact constantly, enabling close collaboration between global and local features for vehicle re-identification. Zhuang et al. [36] propose a network consisting of a two-streams (LiDAR stream and camera stream), which extract features from two modes respectively to realize information interaction between RGB and LIDAR modes. Improving the result of semantic segmentation by information interaction between different modes and RGB mode is feasible.

3 Methods

3.1 Overview

Fig. 2 depicts the overall structure of the network. The architecture follows an encoder-decoder design, employing skip connections to facilitate information flow between encoding and decoding layers. The encoder comprises a dual-branch convolutional network, with each branch respective to extracting RGB features and depth features. We utilize two pre-trained ResNet50 models as the backbone, which exclude the final global average pooling layer and fully connected layers. Subsequently, a decoder is employed to upsample the features, progressively restoring image resolution incrementally.

Given a RGB image ${I}_{RGB}\in{\mathbb{R}}^{h\times w\times 3}$ , and a Depth image ${I}_{Dep}\in{\mathbb{R}}^{h\times w\times 1}$ , $3\times 3$ convolution is used to extract them shallow features ${\bm{F}}_{RGB}^{0}$ and ${\bm{F}}_{Dep}^{0}$ , which can be expressed as:

{\bm{F}}_{RGB}^{0}={Conv}_{3\times 3}({I}_{RGB})

(3.1)

{\bm{F}}_{Dep}^{0}={Conv}_{3\times 3}({I}_{Dep})

(3.2)

where ${Conv}_{3\times 3}$ denotes $3\times 3$ convolution.

The network mainly consists of a four-layer encoder-decoder and introduces two feature fusion modules: MIM and the PAM. Each layer of the encoder consistes of a ResNetLayer. After ${\bm{F}}_{i}^{0}$ passing through the ResNetLayer, ${\bm{F}}_{i}^{n}$ is obtained, the n-th layer of the encoder can be expressed as:

{\bm{F}}_{i}^{n}=H_{i}^{n}({\bm{F}}_{i}^{n-1})

(3.3)

where ${H}_{i}^{n}$ (n = 1, 2, 3, 4) represents the n-th ResNetLayer, $i\in\{RGB,Depth\}$ denotes the RGB feature or Depth feature. Specifically, the first three multi-level RGB features (ResNetLayer1-ResNetLayer3) and depth features (ResNetLayer1-ResNetLayer3) of the ResNet encoder are fed into the PAM module. Pooled attention weighting operations are performed on the RGB features and depth features separately to obtain $\tilde{\bm{F}}_{RGB}^{n}$ and $\tilde{\bm{F}}_{Dep}^{n}$ , where n = 1, 2, 3. Subsequently, the two features are combined by element-wise addition to obtain $\tilde{\bm{F}}_{Con}^{n}$ , containing rich spatial location information. Furthermore, the final RGB and depth features from the ResNetLayer4 encoder are fed into the MIM module to capture complementary information within these two modalities. The output features of the MIM module are then fed into the decoder, where each upsampling layer consists of two 3 $\times$ 3 convolutional layers. These layers are followed by batch normalization (BN) and ReLU activation, with each upsampling layer doubling the feature spatial dimensions while halving the number of channels.

3.1.1 Pooling Attention Module

Within the low-level features extracted by the convolutional neural network, we capture the fundamental attributes of the input image. These low-level features are critical in modelling the image’s foundational characteristics. However, they lack semantic information from the high-level neural network, such as object shapes and categories. At the same time, during the upsampling process in the decoding layer, there is a risk of losing certain semantic information as the image resolution increases. We introduce the Pooling Attention Module (PAM) to address this issue. The PAM module enhances the representation of these features by using an attention mechanism to focus on critical areas in the low-level feature map. In the decoding layer, we integrate the PAM module’s output with the upsampling layer’s input, effectively compensating for information loss during the upsampling process. This strategy improves the accuracy of segmentation results and efficiently maintains the integrity of semantic information, as shown in Fig. 3.

The input featre $\bm{\bm{F}}_{i}^{n}\in{\mathbb{R}}^{h\times w\times c}$ where $i\in\{RGB,Depth\}$ denotes the RGB feature or Depth feature passes through adaptive average pooling to reduce the feature map to a smaller dimension:

\bm{A}={H}_{ada}({\bm{F}}_{i}^{n})

(3.4)

where $\bm{A}\in{\mathbb{R}}^{{h}^{\prime}\times{w}^{\prime}\times c}$ represents the feature map that has been resized by adaptive averaging pooling, ${H}_{ada}$ denotes the adaptive average pooling operation. ${h}^{\prime}$ , ${w}^{\prime}$ represent the height and width of the output feature map, which we set ${h}^{\prime}=2$ and ${w}^{\prime}=2$ . Then we get the output features ${\bm{A}}^{\prime}$ by max pooling the features after dimensionality reduction:

{\bm{A}}^{\prime}={H}_{max}(\bm{A})

(3.5)

where ${\bm{A}}^{\prime}\in{\mathbb{R}}^{1\times 1\times c}$ represents the pooling result and then ${\bm{A}}^{\prime}$ undergoes a 1 $\times$ 1 convolution and then activation with the sigmoid function, getting a weight vector $\bm{V}$ $\in{\mathbb{R}}^{1\times 1\times c}$ value between 0 and 1. ${H}_{max}$ denotes the max pooling operation. Finally, we perform an Element-wise product for ${\bm{F}}_{i}^{n}$ and $\bm{V}$ , and the result $\tilde{\bm{F}}_{i}^{n}$ can be expressed as:

{\bm{V}}=Sigmoid(\varPhi(\bm{A}^{\prime}))

(3.6)

\tilde{\bm{F}}_{i}^{n}={\bm{F}}_{i}^{n}+({\bm{F}}_{i}^{n}\otimes\bm{V})

(3.7)

where $\otimes$ denotes the Element-wise product, $\varPhi$ denotes 1 $\times$ 1 convolution, and feature maps $\tilde{\bm{F}}_{i}^{n}$ represent the output feature $\tilde{\bm{F}}_{RGB}^{n}$ or $\tilde{\bm{F}}_{Dep}^{n}$ in Fig. 2. We employ two-step pooling operation instead of conventional global average pooling. Firstly, the input features ${\bm{F}}_{i}^{n}$ pass through adaptive average pooling to obtain the middle feature $\bm{A}$ with a specified output size. Then, $\bm{A}$ undergoes max pooling to yield the final result $\bm{A}^{\prime}$ . This modification makes the network pay more attention to local regions in the image, such as objects near the background in the scene. Meanwhile, adapt average pooling can enhance the module’s flexibility, accommodating diverse input feature map dimensions and fully retaining spatial position information in depth features; the visualization results Fig. 5 show the module’s effectiveness. The final output $\tilde{\bm{F}}_{Con}^{n}$ of the PAM in Fig. 2:

\tilde{\bm{F}}_{Con}^{n}=\tilde{\bm{F}}_{RGB}^{n}+\tilde{\bm{F}}_{Dep}^{n}

(3.8)

During the upsampling process, $\tilde{\bm{F}}_{Con}^{n}$ (n = 1, 2, 3) will play a role in the three-level decoder (decoder1-decoder3).

3.2 Multi-modal Interaction Module

When adjacent objects in an image share similar appearances, distinguishing their categories becomes challenging. Factors such as lighting variations and object occlusion, especially in the corners, can lead to their blending with the background. This complexity makes it difficult to precisely identify object edges, leading to misclassification of the object as part of the background. Depth information remains unaffected by lighting conditions and can accurately differentiate between objects and the background based on depth values. Therefore, we designed the MIM module to supplement RGB information with Depth features. Meanwhile, it utilizes RGB features to strengthen the correlation between RGB and depth features.

The Multi-modal Interaction Module achieves dual-mode feature fusion, as depicted in Fig. 4. Here, ${\bm{F}}_{RGB}^{4}\in{\mathbb{R}}^{h\times w\times c}$ and ${\bm{F}}_{Dep}^{4}\in{\mathbb{R}}^{h\times w\times c}$ correspond to the RGB feature and depth feature from the ResNetLayer4. The feature channels are denoted as ’c’, and their spatial dimensions are h $\times$ w. First, the two feature maps are linearly mapped to generate multi-head query(Q), key(K), and value(V) vectors. Here, ’rgb’ and ’dep’ represent the RGB and depth features. These linear mappings are accomplished via fully connected layers, where each attentional head possesses its unique weight matrix. For each attention head, We calculate the dot product between two sets of Q and K and then normalize the results to a range between 0 and 1 using the softmax function to get the transmembrane state attention mask ${\bm{W}}_{rgb}$ and ${\bm{W}}_{dep}$ :

{\bm{W}}_{rgb}=Softmax({\bm{Q}}_{rgb}{\bm{K}}_{dep}^{T}/sqrt(d\_k))

(3.9)

{\bm{W}}_{dep}=Softmax({\bm{Q}}_{dep}{\bm{K}}_{rgb}^{T}/sqrt(d\_k))

(3.10)

where ${\bm{W}}_{rgb}$ and ${\bm{W}}_{dep}$ represent the RGB attention mask and the Depth attention mask, and d_k is the dimension of the vector. Then we calculate the RGB Weighted Feature $\tilde{\bm{F}}_{RGB}$ and the Dep Weighted Feature $\tilde{\bm{F}}_{Dep}$ . We obtain the final output features $\tilde{\bm{F}}_{RGB}^{4}$ and $\tilde{\bm{F}}_{Dep}^{4}$ through the use of a residual connection:

\tilde{\bm{F}}_{RGB}={\bm{W}}_{rgb}\otimes{\bm{V}}_{rgb}

(3.11)

\tilde{\bm{F}}_{RGB}^{4}=\tilde{\bm{F}}_{RGB}+{\bm{F}}_{RGB}^{4}

(3.12)

where $\tilde{\bm{F}}_{RGB}$ represent the RGB Weighted Feature, ${\bm{V}}_{rgb}$ represent the value vector from the RGB feature, multiplying with weight matrix ${\bm{W}}_{rgb}$ . $\tilde{\bm{F}}_{RGB}^{4}$ represents the RGB feature after the fusion with Depth. Likewise:

\tilde{\bm{F}}_{Dep}={\bm{W}}_{dep}\otimes{\bm{V}}_{dep}

(3.13)

\tilde{\bm{F}}_{Dep}^{4}=\tilde{\bm{F}}_{Dep}+{\bm{F}}_{Dep}^{4}

(3.14)

where $\tilde{\bm{F}}_{Dep}$ represent the Depth Weighted Feature, ${\bm{V}}_{dep}$ represent the value vector from the Depth feature, multiplying with weight matrix ${\bm{W}}_{dep}$ . $\tilde{\bm{F}}_{Dep}^{4}$ represents the Depth feature after the fusion with RGB, $\otimes$ represents the Element-wise product. Finally, we can obtain the MIM output through Element-wise sum, which can be formulated as:

\tilde{\bm{F}}_{Con}^{4}=\tilde{\bm{F}}_{RGB}^{4}+\tilde{\bm{F}}_{Dep}^{4}

(3.15)

3.3 Loss Function

In this paper, the network performs supervised learning on four different levels of decoding features. We employ nearest-neighbor interpolation to reduce the resolution of semantic labels. Additionally, 1 $\times$ 1 convolutions and Softmax functions are utilized to compute the classification probability for each pixel within the output features from the four upsample layers, respectively. The loss function $\mathcal{L}_{i}$ of layer i is the pixel-level cross entropy loss:

\mathcal{L}_{i}=-\frac{1}{{N}_{i}}\displaystyle\sum_{\forall p,q}Y(p,q)\log{({% Y}^{\prime}(p,q))}

(3.16)

where ${N}_{i}$ denotes the number of pixels in layer i, p,q is the pixel position, ${Y}^{\prime}$ is the classification probability of the output, and $Y$ is the label category. The final loss function $\mathcal{L}$ of the network is obtained by summing the pixel-level loss functions of the four decoding layers:

\mathcal{L}=\displaystyle\sum_{i=1}^{4}\mathcal{L}_{i}

(3.17)

By optimizing the above loss function, the network can get the final segmentation result after one training.

4 Experimental result and analysis

4.1 Datasets and Evaluation Measures

NYU-Depth V2 dataset [37] is a widely used indoor scene understanding dataset for computer vision and deep learning research. It is an aggregation of video sequences from various indoor scenes recorded by RGB-D cameras from the Microsoft Kinect and is an updated version of the NYU-Depth dataset published by Nathan Silberman and Rob Fergus in 2011. It contains 1449 RGBD images, depth images, and semantic tags in the indoor environment. The dataset includes different indoor scenes, scene types, and unlabeled frames, and each object can be represented by a class and an instance number.

SUN RGB-D dataset [38] contains image samples from multiple scenes, covering various indoor scenes such as offices, bedrooms, and living rooms. It has 37 categories and contains 10335 RGBD images with pixel-level annotations, of which 5285 are used as training images and 5050 are used as test images. This special dataset is captured by four different sensors: Intel RealSence, Asus Xtion, Kinect v1, and v2. Besides, this densely annotated dataset includes 146,617 2D polygons, 64,595 3D bounding boxes with accurate object orientations, and a 3D room layout as well as an imaged-based scene category.We evaluate the results using two standard metrics, Pixel Accuracy (Pixel Acc), and Mean Intersection Over Union (mIoU).

mIoU: Intersection over Union is a measure of semantic segmentation, where the intersection over Union ratio of a class is the ratio of the intersection over Union of its true labels and predicted values, while mIoU is the average intersection over Union ratio of each class in the dataset.

mIoU=\dfrac{1}{k+1}\sum_{i=0}^{k}\frac{{p}_{ii}}{\sum_{j=0}^{k}{p}_{ij}+\sum_{% j=0}^{k}{p}_{ji}-{p}_{ii}}.

(4.1)

where ${p}_{ij}$ represents the predict i as j, and ${p}_{ji}$ represents the predict j as i, ${p}_{ii}$ means to predict the correct value, k represents the number of categories.

Acc: Pixel accuracy refers to pixel accuracy, which is the simplest metric that represents the proportion of correctly labelled pixels to the total number of pixels.

PA=\dfrac{\sum_{i=0}^{k}{p}_{ii}}{\sum_{i=0}^{k}\sum_{j=0}^{k}{p}_{ij}}.

(4.2)

where ${p}_{ii}$ means to predict the correct value, and ${p}_{ij}$ means to predict i to j.k represents the number of categories.

4.2 Implementation Details

We implemented and trained our proposed network model using the PyTorch framework. To enhance the diversity of the training data, we applied random scaling and mirroring. Subsequently, all RGB and depth images were resized to $480\times 480$ for network inputs, and semantic labels were adjusted to sizes of $480\times 480$ , $240\times 240$ , $120\times 120$ , and $60\times 60$ for deep supervision training. As the backbone for our encoder, we utilized a pre-trained ResNet50 [39] from the ImageNet classification dataset [40]. To refine the network structure, following [41, 42, 43], we adjust it by replacing the $7\times 7$ convolution in the input stem with three consecutive $3\times 3$ convolutions. The training was conducted on an NVIDIA GeForce GTX 3090 GPU using stochastic gradient descent optimization. Parameters were set with a batch size of 6, an initial learning rate of 0.003, 500 epochs, and momentum and weight decay values of 0.9 and 0.0005, respectively.

4.3 Quantitative Results on NYUv2 and SUN RGB-D

Firstly, we compare the proposed method against existing approaches using the NYUv2 dataset.

Table 1: MIPANet compared to the state-of-the-art methods on the NYUDv2 dataset.

Model	Method	Backbone	mIoU(%)	Pix.Acc(%)
ResNet34	IEMNet[44]	Res34NBt1D	51.3	76.8
ResNet18	ESANet[16]	2 $\times$ R18	48.2	-
	RDFNet[7]	2 $\times$ R50	47.7	74.8
	ACNet[15]	3 $\times$ R50	48.3	-
	SA-Gate[45]	2 $\times$ R50	50.4	-
ResNet50
	ESANet	2 $\times$ R50	50.5	-
	DynMM[46]	R50	51.0	-
	RedNet[8]	2 $\times$ R50	47.2	-
	SGNet[47]	R101	49.6	75.6
ResNet101	RDFNet	2 $\times$ R101	49.1	75.6
	ShapeConv[48]	R101	51.3	76.4
	Baseline	2 $\times$ R50	47.4	75.1
ResNet50
	Ours(MIPA)	2 $\times$ R50	51.9 (+ 4.5%)	77.2 (+ 2.1%)

Table 2: MIPANet compared to the state-of-the-art methods on the SUN RGB-D dataset.

Model	Method	Backbone	mIoU(%)	Pix.Acc(%)
ResNet34	EMSANet[49]	2 $\times$ R34	48.5	-
	IEMNet[44]	Res34NBt1D	48.3	81.9
	ACNet[15]	3 $\times$ R50	48.1	-
ResNet50	ESANet[16]	2 $\times$ R50	48.3	-
	RedNet[8]	2 $\times$ R50	47.8	81.3
	SGNet[47]	R101	47.1	81.0
	CANet[50]	R101	48.3	82.0
ResNet101
	CGBNet[51]	R101	48.2	82.3
	ShapeConv[48]	R101	48.6	82.2
ResNet152	RDFNet[7]	2 $\times$ R152	47.7	81.5
	Baseline	2 $\times$ R50	45.5	81.1
ResNet50
	Ours(MIPA)	2 $\times$ R50	48.8 (+ 3.3%)	82.3 (+ 1.2%)

Table 1 illustrates our superior performance regarding mIoU and Acc metrics compared to other methods. Specifically, with ResNet50 serving as the encoder in our network, the pixel accuracy and average intersection-over-union (mIoU) for semantic segmentation on the NYUv2 test set reached 77.2% and 51.9%. For example, contrasting our method with RDFNet, which also employs ResNet50, our approach showcased a notable improvement of 2.4% in accuracy (Acc) and 3.2% in mean IoU (mIoU). This underscores a significant enhancement in segmentation accuracy achieved by our MIPANet, leveraging the identical ResNet50 architecture. Compared to SGNet, which utilizes ResNet101, our model demonstrates an improvement of 1.6% and 2.3% in Acc and mIoU, respectively. Notably, our ResNet50 outperforms ResNet101, showcasing the effectiveness of our carefully designed network structure and the multi-modal feature fusion module. These improvements in segmentation results are achieved without the need for complex networks, leading to reduced training time. Here, ”R” represents ResNet, and the symbol ’-’ signifies that the comparison evaluated no accuracy metrics. We further compared different network structures across various methods, explicitly noting that ESANet incorporates two ResNet18s as the backbone, while ACNet utilizes three ResNet50 as the backbone.

Then, we comprehensively compared our proposed algorithm with existing methods on the SUN RGB-D dataset. As depicted in Table 2, our approach consistently achieves higher mIoU scores on the SUN RGB-D dataset than all other methods. For instance, MIPANet outperforms SGNet, exhibiting an improvement of 1.3% and 1.7% in Acc and mIoU, respectively. This observation underscores our module’s ability to maintain superior segmentation accuracy, even when dealing with the extensive SUN RGB-D dataset. For different backbone architectures, ResNet101 generally demonstrates better performance than ResNet50, while ResNet50, in turn, outperforms ResNet18. We opted for ResNet50 as our backbone to achieve commendable performance with reduced training time compared to ResNet101. Notably, our method exhibits an increase of 4.5% and 2.1% in mIoU and Acc on both datasets, respectively, compared to the baseline, as highlighted in the red section of the tables.

4.4 Visualization results on NYUv2

To visually highlight the advancements made by our method in the realm of RGB-D semantic segmentation, we provide visualization results of the network on the NYUv2 dataset.

Compared to the baseline, our method has significantly improved segmentation results. Notably, the dashed box in the figure showcases our network enriched with depth information accurately distinguishes objects from the background. For instance, in the visualization results of the fourth image, the baseline erroneously categorizes the mirror on the wall as part of the background, in the visualization results of the second image, the ACNet and the ESANet mistook the carpet for a part of the floor. In contrast, leveraging depth information, our network discerns the distinct distance information of the mirror from the background, leading to a correct classification of the mirror. Fig. 5 illustrates the visualization results of the proposed algorithm on the NYUv2 dataset. From left to right, the columns depict the RGB image, the Depth image, the baseline model results with ResNet50 backbone, ACNet, ESANet, MIPANet (Ours), and Ground Truth. The algorithm presented in this paper has achieved precise segmentation outcomes in diverse and intricate indoor scenes. Moreover, it excels in segmenting challenging objects like ”carpets” and ”books” while delivering finer-edge segmentation results.

4.5 Ablation Study on PAM and MIM on NYUv2

We conducted ablation experiments comparing PAM and MIM on the NYUv2 dataset as show in Fig. 6.

Specifically, the RGB feature and depth feature input PAM to obtain $\tilde{\bm{F}}_{RGB}^{n}$ and $\tilde{\bm{F}}_{Dep}^{n}$ . Given the modality differences, we addressed the parameter-sharing issue in PAM. Moreover, considering the impact of network depth on information interaction, we applied MIM in both Layer 3 and Layer 4 of the encoder. Fig. 6 presents the results of ablation studies on PAM and MIM using different configurations (B1-B4) on the NYUv2 dataset: B1 (PAM without shared parameters and MIM used on ResNetLayer4), B2 (PAM with shared parameters and MIM used on ResNetLayer4), B3 (PAM without shared parameters and MIM used on ResNetLayer3 and ResNetLayer4), B4 (PAM with shared parameters and MIM used on ResNetLayer3 and ResNetLayer4). Achieving the best results involves using PAM without shared parameters in ResNetLayer1-3 and MIM only in the last layer of the encoder, resulting in the highest mIoU of 51.9%.

4.6 Ablation Study on NYUv2 and SUN-RGBD

To investigate the impact of different modules on segmentation performance, we conducted ablation experiments on NYUv2 and SUN-RGBD datasets, as depicted in Table 3. ’\usym2713’ indicates the usage of a particular module, while ’\usym2715’ means not using the module. For instance, our PAM module exhibited a superiority of 1.5% and 0.9% over the baseline concerning mIoU and Acc indicators. Similarly, our MIM module demonstrated a superiority of 3.7% and 1.9% over the baseline regarding mIoU and Acc indicators. The result suggests that each proposed module can independently enhance segmentation accuracy.Our module surpasses the baseline in fusing cross-modal features, yielding superior results on both datasets. Using both PAM and MIM modules, we achieved the highest mIoU of 51.9% on the NYUv2 dataset and the highest mIoU of 48.8% on the SUN RGB-D dataset. The result highlights that our two designed modules can be collectively optimized to enhance segmentation accuracy.

Table 3: Ablation studies on NYUDv2 and SUN-RGBD dataset for PAM and MIM

Method	PAM	MIM	NYUv-2		SUN-RGBD \bigstrut[t]
			NYUv-2		\bigstrut[b]
			mIoU(%)	Acc(%)	mIoU(%)	Acc(%) \bigstrut[t]
			mIoU(%)	Acc(%)	mIoU(%)	\bigstrut[b]
Baseline	\usym2715	\usym2715	47.4	75.1	45.5	81.1 \bigstrut[t]
Baseline	\usym2715	\usym2715	47.4	75.1	45.5	\bigstrut[b]
Ours	\usym2713	\usym2715	48.9	76.0	47.9	81.3 \bigstrut[t]
	\usym2715	\usym2713	51.1	77.0	48.3	81.5
	\usym2713	\usym2713	51.9	77.2	48.8	82.3 \bigstrut[b]

5 Conclusions

In this paper, we tackle a fundamental challenge in RGB-D semantic segmentation—efficiently fusing features from two distinct modes. We designed an innovative Multi-modal Interaction and Pooling Attention network, which uses a small and flexible PAM module in the shallow layer of the network to enhance the feature extraction capability of the network and uses a MIM module in the last layer of the network to integrate RGB features and depth features effectively. We use the complementary information between RGB and depth mode to improve the accuracy of semantic segmentation in indoor scenes. In future work, we will extend our method to enhance its generalization ability in RGB-D semantic segmentation. Furthermore, we anticipate performance improvements by integrating tasks like depth estimation into the existing framework, facilitating collaborative network interactions. limitation. Our method’s effectiveness has been exclusively validated on CNN networks, but we haven’t verified other network architectures, such as Transformer. In addition, during the segmentation verification on the test set, the requirement to input both RGB and depth images limits the network’s generalization ability. Consequently, the network may not achieve optimal segmentation results for datasets lacking depth information.

Acknowledgments

All sources of funding of the study must be disclosed.

Conflict of interest

The authors declare there is no conflict of interest.

References

[1] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[2] M. Li, M. Wei, X. He, F. Shen, Enhancing part features via contrastive attention module for vehicle re-identification, in: 2022 IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 1816–1820.
[3] Z. Zhang, Microsoft kinect sensor and its effect, IEEE multimedia 19 (2) (2012) 4–10.
[4] Y. He, W.-C. Chiu, M. Keuper, M. Fritz, Std2p: Rgbd semantic segmentation using spatio-temporal data-driven pooling, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7158–7167.
[5] C. Couprie, C. Farabet, L. Najman, Y. LeCun, Indoor semantic segmentation using depth information, arXiv preprint arXiv:1301.3572 (2013).
[6] S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from rgb-d images for object detection and segmentation, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, Springer, 2014, pp. 345–360.
[7] S.-J. Park, K.-S. Hong, S. Lee, Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 4990–4999.
[8] J. Jiang, L. Zheng, F. Luo, Z. Zhang, Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation, arXiv preprint arXiv:1806.01054 (2018).
[9] D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650–2658.
[10] A. Wang, J. Lu, G. Wang, J. Cai, T.-J. Cham, Multi-modal unsupervised feature learning for rgb-d scene labeling, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 453–467.
[11] F. Liu, C. Shen, G. Lin, Deep convolutional neural fields for depth estimation from a single image, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5162–5170.
[12] J. Hu, Z. Huang, F. Shen, D. He, Q. Xian, A bag of tricks for fine-grained roof extraction, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023.
[13] J. Hu, Z. Huang, F. Shen, D. He, Q. Xian, A rubust method for roof extraction and height estimation, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023.
[14] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture, in: Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13, Springer, 2017, pp. 213–228.
[15] X. Hu, K. Yang, L. Fei, K. Wang, Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation, in: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 1440–1444.
[16] D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, H.-M. Gross, Efficient rgb-d semantic segmentation for indoor scene analysis, in: 2021 IEEE international conference on robotics and automation (ICRA), IEEE, 2021, pp. 13525–13531.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
[18] F. Shen, M. Wei, J. Ren, Hsgnet: Object re-identification with hierarchical similarity graph network, arXiv preprint arXiv:2211.05486 (2022).
[19] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for scene segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
[20] F. Shen, J. Zhu, X. Zhu, J. Huang, H. Zeng, Z. Lei, C. Cai, An efficient multiresolution network for vehicle reidentification, IEEE Internet of Things Journal 9 (11) (2022) 9049–9059.
[21] F. Shen, X. Peng, L. Wang, X. Zhang, M. Shu, Y. Wang, Hsgm: A hierarchical similarity graph module for object re-identification, in: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2022, pp. 1–6.
[22] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
[23] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, X. Tang, Residual attention network for image classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3156–3164.
[24] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[25] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, Eca-net: Efficient channel attention for deep convolutional neural networks, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11531–11539.
[26] C. Qiao, F. Shen, X. Wang, R. Wang, F. Cao, S. Zhao, C. Li, A novel multi-frequency coordinated module for sar ship detection, in: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2022, pp. 804–811.
[27] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, W. Liu, Ccnet: Criss-cross attention for semantic segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 603–612.
[28] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, T. Harada, Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 5108–5115.
[29] F. Shen, X. Du, L. Zhang, J. Tang, Triplet contrastive learning for unsupervised vehicle re-identification, arXiv preprint arXiv:2301.09498 (2023).
[30] Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, J. Han, Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2633–2642.
[31] F. Shen, X. Shu, X. Du, J. Tang, Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval, in: Proceedings of the 31th ACM International Conference on Multimedia, 2023.
[32] Y. Sun, W. Zuo, M. Liu, Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes, IEEE Robotics and Automation Letters 4 (3) (2019) 2576–2583.
[33] K. Xiang, K. Yang, K. Wang, Polarization-driven semantic segmentation via efficient attention-bridged fusion, Optics Express 29 (4) (2021) 4802–4820.
[34] F. Shen, J. Zhu, X. Zhu, Y. Xie, J. Huang, Exploring spatial significance via hybrid pyramidal graph network for vehicle re-identification, IEEE Transactions on Intelligent Transportation Systems 23 (7) (2022) 8793–8804.
[35] F. Shen, Y. Xie, J. Zhu, X. Zhu, H. Zeng, Git: Graph interactive transformer for vehicle re-identification, IEEE Transactions on Image Processing (2023).
[36] Z. Zhuang, R. Li, K. Jia, Q. Wang, Y. Li, M. Tan, Perception-aware multi-sensor fusion for 3d lidar semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16260–16270.
[37] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, Springer, 2012, pp. 746–760.
[38] S. Song, S. P. Lichtenberg, J. Xiao, Sun rgb-d: A rgb-d scene understanding benchmark suite, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
[39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (2015) 211–252.
[41] X. Fu, F. Shen, X. Du, Z. Li, Bag of tricks for “vision meet alage” object detection challenge, in: 2022 6th International Conference on Universal Village (UV), IEEE, 2022, pp. 1–4.
[42] F. Shen, X. He, M. Wei, Y. Xie, A competitive method to vipriors object detection challenge, arXiv preprint arXiv:2104.09059 (2021).
[43] F. Shen, Z. Wang, Z. Wang, X. Fu, J. Chen, X. Du, J. Tang, A competitive method for dog nose-print re-identification, arXiv preprint arXiv:2205.15934 (2022).
[44] X. Xu, J. Liu, H. Liu, Interactive efficient multi-task network for rgb-d semantic segmentation, Electronics 12 (18) (2023) 3943.
[45] X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, G. Zeng, Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation, in: European Conference on Computer Vision, Springer, 2020, pp. 561–577.
[46] Z. Xue, R. Marculescu, Dynamic multimodal fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2575–2584.
[47] L.-Z. Chen, Z. Lin, Z. Wang, Y.-L. Yang, M.-M. Cheng, Spatial information guided convolution for real-time rgbd semantic segmentation, IEEE Transactions on Image Processing 30 (2021) 2313–2324.
[48] J. Cao, H. Leng, D. Lischinski, D. Cohen-Or, C. Tu, Y. Li, Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7068–7077.
[49] D. Seichter, S. B. Fischedick, M. Köhler, H.-M. Groß, Efficient multi-task rgb-d scene analysis for indoor environments, in: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–10.
[50] Q. Tang, F. Liu, T. Zhang, J. Jiang, Y. Zhang, Attention-guided chained context aggregation for semantic segmentation, Image and Vision Computing 115 (2021) 104309.
[51] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, G. Wang, Semantic segmentation with context encoding and multi-path decoding, IEEE Transactions on Image Processing 29 (2020) 3520–3533.