License: arXiv.org perpetual non-exclusive license
arXiv:2311.11312v2 [cs.CV] 06 Dec 2023
\corraddr

minghongxie@163.com.

MIPANet: Optimizing RGB-D Semantic Segmentation through Multi-modal Interaction and Pooling Attention

Shuai Zhang 1    Minghong Xie\corrauth 1, \addr11affiliationmark: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500 Yunnan, PR China
Abstract

Semantic segmentation of RGB-D images involves understanding the appearance and spatial relationships of objects within a scene, which requires careful consideration of various factors. However, in indoor environments, the simple input of RGB and depth images often results in a relatively limited acquisition of semantic and spatial information, leading to suboptimal segmentation outcomes. To address this, we propose the Multi-modal Interaction and Pooling Attention Network (MIPANet), designed to harness the interactive synergy between RGB and depth modalities, optimizing the utilization of complementary information. Specifically, we incorporate a Multi-modal Interaction Module (MIM) into the deepest layers of the network. This module is engineered to facilitate the fusion of RGB and depth information, allowing for mutual enhancement and correction. Additionally, we introduce a Pooling Attention Module (PAM) at various stages of the encoder to enhance the features extracted by the network. The outputs of the PAMs are selectively integrated into the decoder to improve semantic segmentation performance. Experimental results demonstrate that MIPANet outperforms existing methods on two indoor scene datasets, NYUDv2 and SUN-RGBD, by optimizing the insufficient information interaction between different modalities in RGB-D semantic segmentation.

keywords:
RGB-D Semantic Segmentation;Attention Mechanism;Multi-modal Interaction

1 Introduction

In recent years, Convolutional Neural Networks (CNN) have been widely used in image semantic segmentation, and more and more high-performance models have gradually replaced the traditional semantic segmentation methods. With the introduction of Fully Convolutional Neural Networks (FCN) [1, 2], which show great potential in semantic segmentation tasks, many researchers have proposed improved semantic segmentation models based on this way. Nevertheless, semantic segmentation remains a formidable challenge in some indoor environments, given the intricacies such as variations in illumination and mutual occlusion between objects.

With the widespread application of depth sensors and depth cameras [3], the research on images is not limited to RGB color images, but the research on RGB-Depth (RGB-D) images containing depth information. RGB images can provide appearance information such as the color and texture of objects, in contrast, depth images can provide three-dimensional geometry information of objects, which is missing in RGB images and is desired for indoor scenes. References [4, 5] simply splice RGB features and depth features to form a four-channel input, improving the accuracy of semantic segmentation. Reference [6] convert depth images into three distinct channels (horizontal disparity, height above ground, and angle of surface normals) to obtain the HHA image, then input the RGB features and HHA features into two parallel CNNs to predict the probability maps of two semantic segmentations, respectively, and fuse them in the last layer of the network as the final segmentation result. Though the above methods have achieved good results in the task of RGB-D semantic segmentation, most RGB-D semantic segmentation [7, 8, 9, 10] simply merges RGB features and depth features by concatenation or summation. As a result, the information differences between the multimodal cannot be solved effectively, which will generate CNN not to use the complementary information between them fully, resulting in object and background confusion. For example, The printer and trash bin in Fig. 1 (a) are prone to be inaccurately assimilated into the background.

Refer to caption
Figure 1: Improve segmentation accuracy by leveraging depth features within our MIPANet. The prediction result can accurately distinguish the trash can and printer from the background.

To solve the above problems, we propose an RGB-D semantic segmentation of the Indoor Scene network, MIPANet. Fig. 2 illustrates the overall structure of the network. The network is an encoder-decoder architecture, including two innovative feature fusion modules: The multi-modal Interaction Module(MIM) and the Pooling Attention Module(PAM). This paper integrates the two fusion modules into an encoder-decoder architecture. The encoder is composed of two identical CNN branches, each specifically designed for extracting RGB features and depth features, respectively. In this study, RGB and depth features are extracted and fused incrementally across various network levels, optimizing semantic segmentation results utilizing spatial disparities and semantic interdependencies among multimodal features. In the PAM, we use adaptive averaging instead of global averaging, which approach not only allows for flexible adjustment of the output size but also preserves more spatial information, facilitating enhanced extraction of depth features. In MIM, we obtain two sets of Q,K,V for different modalities and perform calculations using the Q,K from one set and V from the other. This achieves information interaction between the RGB and depth modalities. This paper’s main contributions can be summarized as follows:

\bullet We introduce an end-to-end multi-modal fusion network, MIPANet, incorporating multi-modal interaction and pooling attention. This innovative approach optimizes integrating complementary information from RGB and depth features, effectively tackling the challenge posed by insufficient cross-modal feature fusion in RGB-D semantic segmentation.

\bullet We present two cross-modal feature fusion methods. Within the MIM, a cross-modal feature interaction and fusion mechanism were developed. RGB and depth features are collaboratively optimized using attention masks to extract partially detailed features. In addition, PAM integrates intermediate layer features into the decoder, enhancing feature extraction and supporting the decoder in upsampling and recovery.

\bullet Experimental results confirm the effectiveness of our proposed RGB-D semantic segmentation network in accurately handling indoor images in complex scenarios. The model demonstrated superior semantic segmentation performance compared to other methods on the publicly available NYUv2 and SUN RGB-D datasets.

2 Related Work

In this section, we provide a comprehensive review of three parts: (1) RGB-D Semantic Segmentation, (2) Attention Mechanism, and (3) Cross-modal Interaction.

2.1 RGB-D Semantic Segmentation

With the widespread application of depth sensors and depth cameras in the field of depth estimation [11, 9, 12, 13], people can obtain the depth information of the scene more conveniently, and the research on the image is no longer limited to a single RGB image. RGB-D semantic segmentation task is to efficiently integrate RGB features and depth features to improve segmentation accuracy, especially in some indoor scenes. Couprie et al. [4] proposed an early fusion approach, which simply concatenates an image’s RGB and depth channels as a four-channel input to the convolutional neural network. Wang et al. [6] separately input RGB features and HHA features into two CNNs for prediction and perform fusion in the final stage of the network, and [14] introduced an encoding-decoding network, employing a dual-branch RGB encoder to extract features separately from RGB images and depth images. The studies mentioned above employed equal-weight concatenation or summation operations to fuse RGB and depth features without fully leveraging the complementary information between different modalities. In recent years, some research has proposed more effective strategies for RGB-D feature fusion. Hu et al. [15] utilised a three-branch encoder that includes RGB, Depth, and Fusion branches, efficiently collecting features without breaking the original RGB and deep inference branches. Seichter et al. [16] have presented an efficient RGB-D segmentation approach, characterised by two enhanced ResNet-based encoders utilising an attention-based fusion for incorporating depth information. However, these methods did not fully exploit the differential information between the two modalities and the intermediate-level features extracted by the convolutional network.

2.2 Attention Mechanism

In recent years, attention [17, 18, 19, 20, 21, 22] has been widely used in computer vision and other fields. Vaswani et al. [17] proposed the self-attention mechanism, which has had a profound impact on the design of the deep learning model. Fu et al. [19] proposed DANet, which can adaptively integrate local features and their global dependencies. Wang et al. [23] utilised spatial attention in an image classification model. Through the backpropagation of a convolutional neural network, they adaptively learned spatial attention masks, allowing the model to focus on the significant regions of the image. SENet [24] has proposed channel attention, which adaptively learns the importance of each feature channel through a neural network. Woo et al. [22] incorporates two attention modules that concurrently capture channel-wise and spatial relationships. ECA-Net [25] introduces a straightforward and efficient ”local” channel attention mechanism to minimize computational overhead. MFC [26]introduced a multi-frequency domain attention module to capture information across different frequency domains. Similarly, CAMNet [2] proposed a contrastive attention module designed to amplify local saliency. Building upon this foundation, Huang et al. [27] proposed a cross-attention module that consolidates contextual information both horizontally and vertically, which can gather contextual information from all pixels. These methods have demonstrated significant potential in single-mode feature extraction. To effectively leverage the complementary information between different modalities, this paper introduces a Pooling Attention module that learns the differential information between two distinct modalities and fully exploits the intermediate-level features in the convolutional network and long-range semantic dependencies between modalities.

2.3 Cross-modal Interaction

With the development of sensor technology, different types of sensors can provide a variety of modal information for semantic segmentation tasks to achieve information interaction [28, 29, 30, 31, 32] between RGB mode and other modes. The interaction between RGB and infrared modalities enhanced the effectiveness of semantic segmentation in RGB-T scenarios. Xiang et al. [33] used a single-shot polarization sensor to build the first RGB-P dataset, incorporated polarization sensing to obtain supplementary information, and improved the accuracy of segmentation for many categories, especially those with polarization characteristics, such as glass. HPGN [34] proposes a novel pyramid graph network targeting features, which is closely connected behind the backbone network to explore multi-scale spatial structural features. GiT [35] proposes a structure where graphs and transformers interact constantly, enabling close collaboration between global and local features for vehicle re-identification. Zhuang et al. [36] propose a network consisting of a two-streams (LiDAR stream and camera stream), which extract features from two modes respectively to realize information interaction between RGB and LIDAR modes. Improving the result of semantic segmentation by information interaction between different modes and RGB mode is feasible.

3 Methods

3.1 Overview

Fig. 2 depicts the overall structure of the network. The architecture follows an encoder-decoder design, employing skip connections to facilitate information flow between encoding and decoding layers. The encoder comprises a dual-branch convolutional network, with each branch respective to extracting RGB features and depth features. We utilize two pre-trained ResNet50 models as the backbone, which exclude the final global average pooling layer and fully connected layers. Subsequently, a decoder is employed to upsample the features, progressively restoring image resolution incrementally.

Refer to caption
Figure 2: Multi-modal Interaction And Pooling Attention (MIPA) Network architecture. Each PAM at different network levels generates two weight-unshared features: RGB features denoted as 𝑭~RGBnsuperscriptsubscript~𝑭𝑅𝐺𝐵𝑛\tilde{\bm{F}}_{RGB}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and depth features denoted as 𝑭~Depnsuperscriptsubscript~𝑭𝐷𝑒𝑝𝑛\tilde{\bm{F}}_{Dep}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Following an Element-wise sum, we obtain 𝑭~Connsuperscriptsubscript~𝑭𝐶𝑜𝑛𝑛\tilde{\bm{F}}_{Con}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where n denotes the network level. MIM receives RGB and depth features from the ResNetLayer4 and integrates the fusion result 𝑭Con4superscriptsubscript𝑭𝐶𝑜𝑛4{\bm{F}}_{Con}^{4}bold_italic_F start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT into the decoder.

Given a RGB image IRGBh×w×3subscript𝐼𝑅𝐺𝐵superscript𝑤3{I}_{RGB}\in{\mathbb{R}}^{h\times w\times 3}italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT, and a Depth image IDeph×w×1subscript𝐼𝐷𝑒𝑝superscript𝑤1{I}_{Dep}\in{\mathbb{R}}^{h\times w\times 1}italic_I start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 1 end_POSTSUPERSCRIPT, 3×3333\times 33 × 3 convolution is used to extract them shallow features 𝑭RGB0superscriptsubscript𝑭𝑅𝐺𝐵0{\bm{F}}_{RGB}^{0}bold_italic_F start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝑭Dep0superscriptsubscript𝑭𝐷𝑒𝑝0{\bm{F}}_{Dep}^{0}bold_italic_F start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, which can be expressed as:

𝑭RGB0=Conv3×3(IRGB)superscriptsubscript𝑭𝑅𝐺𝐵0𝐶𝑜𝑛subscript𝑣33subscript𝐼𝑅𝐺𝐵{\bm{F}}_{RGB}^{0}={Conv}_{3\times 3}({I}_{RGB})bold_italic_F start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT ) (3.1)
𝑭Dep0=Conv3×3(IDep)superscriptsubscript𝑭𝐷𝑒𝑝0𝐶𝑜𝑛subscript𝑣33subscript𝐼𝐷𝑒𝑝{\bm{F}}_{Dep}^{0}={Conv}_{3\times 3}({I}_{Dep})bold_italic_F start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT ) (3.2)

where Conv3×3𝐶𝑜𝑛subscript𝑣33{Conv}_{3\times 3}italic_C italic_o italic_n italic_v start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT denotes 3×3333\times 33 × 3 convolution.

The network mainly consists of a four-layer encoder-decoder and introduces two feature fusion modules: MIM and the PAM. Each layer of the encoder consistes of a ResNetLayer. After 𝑭i0superscriptsubscript𝑭𝑖0{\bm{F}}_{i}^{0}bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT passing through the ResNetLayer, 𝑭insuperscriptsubscript𝑭𝑖𝑛{\bm{F}}_{i}^{n}bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is obtained, the n-th layer of the encoder can be expressed as:

𝑭in=Hin(𝑭in1)superscriptsubscript𝑭𝑖𝑛superscriptsubscript𝐻𝑖𝑛superscriptsubscript𝑭𝑖𝑛1{\bm{F}}_{i}^{n}=H_{i}^{n}({\bm{F}}_{i}^{n-1})bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ) (3.3)

where Hinsuperscriptsubscript𝐻𝑖𝑛{H}_{i}^{n}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (n = 1, 2, 3, 4) represents the n-th ResNetLayer, i{RGB,Depth}𝑖𝑅𝐺𝐵𝐷𝑒𝑝𝑡i\in\{RGB,Depth\}italic_i ∈ { italic_R italic_G italic_B , italic_D italic_e italic_p italic_t italic_h } denotes the RGB feature or Depth feature. Specifically, the first three multi-level RGB features (ResNetLayer1-ResNetLayer3) and depth features (ResNetLayer1-ResNetLayer3) of the ResNet encoder are fed into the PAM module. Pooled attention weighting operations are performed on the RGB features and depth features separately to obtain 𝑭~RGBnsuperscriptsubscript~𝑭𝑅𝐺𝐵𝑛\tilde{\bm{F}}_{RGB}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝑭~Depnsuperscriptsubscript~𝑭𝐷𝑒𝑝𝑛\tilde{\bm{F}}_{Dep}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where n = 1, 2, 3. Subsequently, the two features are combined by element-wise addition to obtain 𝑭~Connsuperscriptsubscript~𝑭𝐶𝑜𝑛𝑛\tilde{\bm{F}}_{Con}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, containing rich spatial location information. Furthermore, the final RGB and depth features from the ResNetLayer4 encoder are fed into the MIM module to capture complementary information within these two modalities. The output features of the MIM module are then fed into the decoder, where each upsampling layer consists of two 3 ×\times× 3 convolutional layers. These layers are followed by batch normalization (BN) and ReLU activation, with each upsampling layer doubling the feature spatial dimensions while halving the number of channels.

3.1.1 Pooling Attention Module

Within the low-level features extracted by the convolutional neural network, we capture the fundamental attributes of the input image. These low-level features are critical in modelling the image’s foundational characteristics. However, they lack semantic information from the high-level neural network, such as object shapes and categories. At the same time, during the upsampling process in the decoding layer, there is a risk of losing certain semantic information as the image resolution increases. We introduce the Pooling Attention Module (PAM) to address this issue. The PAM module enhances the representation of these features by using an attention mechanism to focus on critical areas in the low-level feature map. In the decoding layer, we integrate the PAM module’s output with the upsampling layer’s input, effectively compensating for information loss during the upsampling process. This strategy improves the accuracy of segmentation results and efficiently maintains the integrity of semantic information, as shown in Fig. 3.

Refer to caption
Figure 3: The details of the Pooling Attention Module. After a two-step pooling operation, we obtain the pooling result 𝑨superscript𝑨\bm{A}^{\prime}bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Subsequently, through a 1 ×\times× 1 convolution and sigmoid activation function, constrain the value of weight vector 𝑽𝑽\bm{V}bold_italic_V (e.g., yellow) between 0 and 1. The output feature 𝑭~insuperscriptsubscript~𝑭𝑖𝑛\tilde{\bm{F}}_{i}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is obtained by taking the weighted sum of the input feature 𝑭insuperscriptsubscript𝑭𝑖𝑛{\bm{F}}_{i}^{n}bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

The input featre 𝑭inh×w×csuperscriptsubscript𝑭𝑖𝑛superscript𝑤𝑐\bm{\bm{F}}_{i}^{n}\in{\mathbb{R}}^{h\times w\times c}bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT where i{RGB,Depth}𝑖𝑅𝐺𝐵𝐷𝑒𝑝𝑡i\in\{RGB,Depth\}italic_i ∈ { italic_R italic_G italic_B , italic_D italic_e italic_p italic_t italic_h } denotes the RGB feature or Depth feature passes through adaptive average pooling to reduce the feature map to a smaller dimension:

𝑨=Hada(𝑭in)𝑨subscript𝐻𝑎𝑑𝑎superscriptsubscript𝑭𝑖𝑛\bm{A}={H}_{ada}({\bm{F}}_{i}^{n})bold_italic_A = italic_H start_POSTSUBSCRIPT italic_a italic_d italic_a end_POSTSUBSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) (3.4)

where 𝑨h×w×c𝑨superscriptsuperscriptsuperscript𝑤𝑐\bm{A}\in{\mathbb{R}}^{{h}^{\prime}\times{w}^{\prime}\times c}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_c end_POSTSUPERSCRIPT represents the feature map that has been resized by adaptive averaging pooling, Hadasubscript𝐻𝑎𝑑𝑎{H}_{ada}italic_H start_POSTSUBSCRIPT italic_a italic_d italic_a end_POSTSUBSCRIPT denotes the adaptive average pooling operation. hsuperscript{h}^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT,wsuperscript𝑤{w}^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represent the height and width of the output feature map, which we set h=2superscript2{h}^{\prime}=2italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 and w=2superscript𝑤2{w}^{\prime}=2italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2. Then we get the output features 𝑨superscript𝑨{\bm{A}}^{\prime}bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by max pooling the features after dimensionality reduction:

𝑨=Hmax(𝑨)superscript𝑨subscript𝐻𝑚𝑎𝑥𝑨{\bm{A}}^{\prime}={H}_{max}(\bm{A})bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( bold_italic_A ) (3.5)

where 𝑨1×1×csuperscript𝑨superscript11𝑐{\bm{A}}^{\prime}\in{\mathbb{R}}^{1\times 1\times c}bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_c end_POSTSUPERSCRIPT represents the pooling result and then 𝑨superscript𝑨{\bm{A}}^{\prime}bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT undergoes a 1 ×\times× 1 convolution and then activation with the sigmoid function, getting a weight vector 𝑽𝑽\bm{V}bold_italic_V 1×1×cabsentsuperscript11𝑐\in{\mathbb{R}}^{1\times 1\times c}∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_c end_POSTSUPERSCRIPT value between 0 and 1. Hmaxsubscript𝐻𝑚𝑎𝑥{H}_{max}italic_H start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT denotes the max pooling operation. Finally, we perform an Element-wise product for 𝑭insuperscriptsubscript𝑭𝑖𝑛{\bm{F}}_{i}^{n}bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝑽𝑽\bm{V}bold_italic_V, and the result 𝑭~insuperscriptsubscript~𝑭𝑖𝑛\tilde{\bm{F}}_{i}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be expressed as:

𝑽=Sigmoid(Φ(𝑨))𝑽𝑆𝑖𝑔𝑚𝑜𝑖𝑑Φsuperscript𝑨{\bm{V}}=Sigmoid(\varPhi(\bm{A}^{\prime}))bold_italic_V = italic_S italic_i italic_g italic_m italic_o italic_i italic_d ( roman_Φ ( bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (3.6)
𝑭~in=𝑭in+(𝑭in𝑽)superscriptsubscript~𝑭𝑖𝑛superscriptsubscript𝑭𝑖𝑛tensor-productsuperscriptsubscript𝑭𝑖𝑛𝑽\tilde{\bm{F}}_{i}^{n}={\bm{F}}_{i}^{n}+({\bm{F}}_{i}^{n}\otimes\bm{V})over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + ( bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊗ bold_italic_V ) (3.7)

where tensor-product\otimes denotes the Element-wise product, ΦΦ\varPhiroman_Φ denotes 1 ×\times× 1 convolution, and feature maps 𝑭~insuperscriptsubscript~𝑭𝑖𝑛\tilde{\bm{F}}_{i}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represent the output feature 𝑭~RGBnsuperscriptsubscript~𝑭𝑅𝐺𝐵𝑛\tilde{\bm{F}}_{RGB}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT or 𝑭~Depnsuperscriptsubscript~𝑭𝐷𝑒𝑝𝑛\tilde{\bm{F}}_{Dep}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in Fig. 2. We employ two-step pooling operation instead of conventional global average pooling. Firstly, the input features 𝑭insuperscriptsubscript𝑭𝑖𝑛{\bm{F}}_{i}^{n}bold_italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT pass through adaptive average pooling to obtain the middle feature 𝑨𝑨\bm{A}bold_italic_A with a specified output size. Then, 𝑨𝑨\bm{A}bold_italic_A undergoes max pooling to yield the final result 𝑨superscript𝑨\bm{A}^{\prime}bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This modification makes the network pay more attention to local regions in the image, such as objects near the background in the scene. Meanwhile, adapt average pooling can enhance the module’s flexibility, accommodating diverse input feature map dimensions and fully retaining spatial position information in depth features; the visualization results Fig. 5 show the module’s effectiveness. The final output 𝑭~Connsuperscriptsubscript~𝑭𝐶𝑜𝑛𝑛\tilde{\bm{F}}_{Con}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT of the PAM in Fig. 2:

𝑭~Conn=𝑭~RGBn+𝑭~Depnsuperscriptsubscript~𝑭𝐶𝑜𝑛𝑛superscriptsubscript~𝑭𝑅𝐺𝐵𝑛superscriptsubscript~𝑭𝐷𝑒𝑝𝑛\tilde{\bm{F}}_{Con}^{n}=\tilde{\bm{F}}_{RGB}^{n}+\tilde{\bm{F}}_{Dep}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (3.8)

During the upsampling process, 𝑭~Connsuperscriptsubscript~𝑭𝐶𝑜𝑛𝑛\tilde{\bm{F}}_{Con}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (n = 1, 2, 3) will play a role in the three-level decoder (decoder1-decoder3).

3.2 Multi-modal Interaction Module

When adjacent objects in an image share similar appearances, distinguishing their categories becomes challenging. Factors such as lighting variations and object occlusion, especially in the corners, can lead to their blending with the background. This complexity makes it difficult to precisely identify object edges, leading to misclassification of the object as part of the background. Depth information remains unaffected by lighting conditions and can accurately differentiate between objects and the background based on depth values. Therefore, we designed the MIM module to supplement RGB information with Depth features. Meanwhile, it utilizes RGB features to strengthen the correlation between RGB and depth features.

Refer to caption
Figure 4: Multi-modal Interaction Module. The RGB feature and the depth feature undergo linear transformations to generate two sets of Q,K,V (e.g., blue line) for multi-head attention, where h denotes the number of attention heads set to 8. The weighted summation of input features 𝑭RGB4superscriptsubscript𝑭𝑅𝐺𝐵4{\bm{F}}_{RGB}^{4}bold_italic_F start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and 𝑭Dep4superscriptsubscript𝑭𝐷𝑒𝑝4{\bm{F}}_{Dep}^{4}bold_italic_F start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT yields 𝑭~RGB4superscriptsubscript~𝑭𝑅𝐺𝐵4\tilde{\bm{F}}_{RGB}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and 𝑭~Dep4superscriptsubscript~𝑭𝐷𝑒𝑝4\tilde{\bm{F}}_{Dep}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, which are then element-wise added to obtain the output result 𝑭~Con4superscriptsubscript~𝑭𝐶𝑜𝑛4\tilde{\bm{F}}_{Con}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT.

The Multi-modal Interaction Module achieves dual-mode feature fusion, as depicted in Fig. 4. Here, 𝑭RGB4h×w×csuperscriptsubscript𝑭𝑅𝐺𝐵4superscript𝑤𝑐{\bm{F}}_{RGB}^{4}\in{\mathbb{R}}^{h\times w\times c}bold_italic_F start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT and 𝑭Dep4h×w×csuperscriptsubscript𝑭𝐷𝑒𝑝4superscript𝑤𝑐{\bm{F}}_{Dep}^{4}\in{\mathbb{R}}^{h\times w\times c}bold_italic_F start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT correspond to the RGB feature and depth feature from the ResNetLayer4. The feature channels are denoted as ’c’, and their spatial dimensions are h ×\times× w. First, the two feature maps are linearly mapped to generate multi-head query(Q), key(K), and value(V) vectors. Here, ’rgb’ and ’dep’ represent the RGB and depth features. These linear mappings are accomplished via fully connected layers, where each attentional head possesses its unique weight matrix. For each attention head, We calculate the dot product between two sets of Q and K and then normalize the results to a range between 0 and 1 using the softmax function to get the transmembrane state attention mask 𝑾rgbsubscript𝑾𝑟𝑔𝑏{\bm{W}}_{rgb}bold_italic_W start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT and 𝑾depsubscript𝑾𝑑𝑒𝑝{\bm{W}}_{dep}bold_italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT:

𝑾rgb=Softmax(𝑸rgb𝑲depT/sqrt(d_k))subscript𝑾𝑟𝑔𝑏𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑸𝑟𝑔𝑏superscriptsubscript𝑲𝑑𝑒𝑝𝑇𝑠𝑞𝑟𝑡𝑑_𝑘{\bm{W}}_{rgb}=Softmax({\bm{Q}}_{rgb}{\bm{K}}_{dep}^{T}/sqrt(d\_k))bold_italic_W start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( bold_italic_Q start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / italic_s italic_q italic_r italic_t ( italic_d _ italic_k ) ) (3.9)
𝑾dep=Softmax(𝑸dep𝑲rgbT/sqrt(d_k))subscript𝑾𝑑𝑒𝑝𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑸𝑑𝑒𝑝superscriptsubscript𝑲𝑟𝑔𝑏𝑇𝑠𝑞𝑟𝑡𝑑_𝑘{\bm{W}}_{dep}=Softmax({\bm{Q}}_{dep}{\bm{K}}_{rgb}^{T}/sqrt(d\_k))bold_italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( bold_italic_Q start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / italic_s italic_q italic_r italic_t ( italic_d _ italic_k ) ) (3.10)

where 𝑾rgbsubscript𝑾𝑟𝑔𝑏{\bm{W}}_{rgb}bold_italic_W start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT and 𝑾depsubscript𝑾𝑑𝑒𝑝{\bm{W}}_{dep}bold_italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT represent the RGB attention mask and the Depth attention mask, and d_k is the dimension of the vector. Then we calculate the RGB Weighted Feature 𝑭~RGBsubscript~𝑭𝑅𝐺𝐵\tilde{\bm{F}}_{RGB}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT and the Dep Weighted Feature 𝑭~Depsubscript~𝑭𝐷𝑒𝑝\tilde{\bm{F}}_{Dep}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT. We obtain the final output features 𝑭~RGB4superscriptsubscript~𝑭𝑅𝐺𝐵4\tilde{\bm{F}}_{RGB}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and 𝑭~Dep4superscriptsubscript~𝑭𝐷𝑒𝑝4\tilde{\bm{F}}_{Dep}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT through the use of a residual connection:

𝑭~RGB=𝑾rgb𝑽rgbsubscript~𝑭𝑅𝐺𝐵tensor-productsubscript𝑾𝑟𝑔𝑏subscript𝑽𝑟𝑔𝑏\tilde{\bm{F}}_{RGB}={\bm{W}}_{rgb}\otimes{\bm{V}}_{rgb}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT ⊗ bold_italic_V start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT (3.11)
𝑭~RGB4=𝑭~RGB+𝑭RGB4superscriptsubscript~𝑭𝑅𝐺𝐵4subscript~𝑭𝑅𝐺𝐵superscriptsubscript𝑭𝑅𝐺𝐵4\tilde{\bm{F}}_{RGB}^{4}=\tilde{\bm{F}}_{RGB}+{\bm{F}}_{RGB}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT + bold_italic_F start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (3.12)

where 𝑭~RGBsubscript~𝑭𝑅𝐺𝐵\tilde{\bm{F}}_{RGB}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT represent the RGB Weighted Feature,𝑽rgbsubscript𝑽𝑟𝑔𝑏{\bm{V}}_{rgb}bold_italic_V start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT represent the value vector from the RGB feature, multiplying with weight matrix 𝑾rgbsubscript𝑾𝑟𝑔𝑏{\bm{W}}_{rgb}bold_italic_W start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT. 𝑭~RGB4superscriptsubscript~𝑭𝑅𝐺𝐵4\tilde{\bm{F}}_{RGB}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT represents the RGB feature after the fusion with Depth. Likewise:

𝑭~Dep=𝑾dep𝑽depsubscript~𝑭𝐷𝑒𝑝tensor-productsubscript𝑾𝑑𝑒𝑝subscript𝑽𝑑𝑒𝑝\tilde{\bm{F}}_{Dep}={\bm{W}}_{dep}\otimes{\bm{V}}_{dep}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT ⊗ bold_italic_V start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT (3.13)
𝑭~Dep4=𝑭~Dep+𝑭Dep4superscriptsubscript~𝑭𝐷𝑒𝑝4subscript~𝑭𝐷𝑒𝑝superscriptsubscript𝑭𝐷𝑒𝑝4\tilde{\bm{F}}_{Dep}^{4}=\tilde{\bm{F}}_{Dep}+{\bm{F}}_{Dep}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT + bold_italic_F start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (3.14)

where 𝑭~Depsubscript~𝑭𝐷𝑒𝑝\tilde{\bm{F}}_{Dep}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT represent the Depth Weighted Feature, 𝑽depsubscript𝑽𝑑𝑒𝑝{\bm{V}}_{dep}bold_italic_V start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT represent the value vector from the Depth feature, multiplying with weight matrix 𝑾depsubscript𝑾𝑑𝑒𝑝{\bm{W}}_{dep}bold_italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p end_POSTSUBSCRIPT. 𝑭~Dep4superscriptsubscript~𝑭𝐷𝑒𝑝4\tilde{\bm{F}}_{Dep}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT represents the Depth feature after the fusion with RGB, tensor-product\otimes represents the Element-wise product. Finally, we can obtain the MIM output through Element-wise sum, which can be formulated as:

𝑭~Con4=𝑭~RGB4+𝑭~Dep4superscriptsubscript~𝑭𝐶𝑜𝑛4superscriptsubscript~𝑭𝑅𝐺𝐵4superscriptsubscript~𝑭𝐷𝑒𝑝4\tilde{\bm{F}}_{Con}^{4}=\tilde{\bm{F}}_{RGB}^{4}+\tilde{\bm{F}}_{Dep}^{4}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_C italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT (3.15)

3.3 Loss Function

In this paper, the network performs supervised learning on four different levels of decoding features. We employ nearest-neighbor interpolation to reduce the resolution of semantic labels. Additionally, 1 ×\times× 1 convolutions and Softmax functions are utilized to compute the classification probability for each pixel within the output features from the four upsample layers, respectively. The loss function isubscript𝑖\mathcal{L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of layer i is the pixel-level cross entropy loss:

i=1Nip,qY(p,q)log(Y(p,q))subscript𝑖1subscript𝑁𝑖subscriptfor-all𝑝𝑞𝑌𝑝𝑞superscript𝑌𝑝𝑞\mathcal{L}_{i}=-\frac{1}{{N}_{i}}\displaystyle\sum_{\forall p,q}Y(p,q)\log{({% Y}^{\prime}(p,q))}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT ∀ italic_p , italic_q end_POSTSUBSCRIPT italic_Y ( italic_p , italic_q ) roman_log ( italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_p , italic_q ) ) (3.16)

where Nisubscript𝑁𝑖{N}_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of pixels in layer i, p,q is the pixel position, Ysuperscript𝑌{Y}^{\prime}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the classification probability of the output, and Y𝑌Yitalic_Y is the label category. The final loss function \mathcal{L}caligraphic_L of the network is obtained by summing the pixel-level loss functions of the four decoding layers:

=i=14isuperscriptsubscript𝑖14subscript𝑖\mathcal{L}=\displaystyle\sum_{i=1}^{4}\mathcal{L}_{i}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (3.17)

By optimizing the above loss function, the network can get the final segmentation result after one training.

4 Experimental result and analysis

4.1 Datasets and Evaluation Measures

NYU-Depth V2 dataset [37] is a widely used indoor scene understanding dataset for computer vision and deep learning research. It is an aggregation of video sequences from various indoor scenes recorded by RGB-D cameras from the Microsoft Kinect and is an updated version of the NYU-Depth dataset published by Nathan Silberman and Rob Fergus in 2011. It contains 1449 RGBD images, depth images, and semantic tags in the indoor environment. The dataset includes different indoor scenes, scene types, and unlabeled frames, and each object can be represented by a class and an instance number.

SUN RGB-D dataset [38] contains image samples from multiple scenes, covering various indoor scenes such as offices, bedrooms, and living rooms. It has 37 categories and contains 10335 RGBD images with pixel-level annotations, of which 5285 are used as training images and 5050 are used as test images. This special dataset is captured by four different sensors: Intel RealSence, Asus Xtion, Kinect v1, and v2. Besides, this densely annotated dataset includes 146,617 2D polygons, 64,595 3D bounding boxes with accurate object orientations, and a 3D room layout as well as an imaged-based scene category.We evaluate the results using two standard metrics, Pixel Accuracy (Pixel Acc), and Mean Intersection Over Union (mIoU).

mIoU: Intersection over Union is a measure of semantic segmentation, where the intersection over Union ratio of a class is the ratio of the intersection over Union of its true labels and predicted values, while mIoU is the average intersection over Union ratio of each class in the dataset.

mIoU=1k+1i=0kpiij=0kpij+j=0kpjipii.𝑚𝐼𝑜𝑈1𝑘1superscriptsubscript𝑖0𝑘subscript𝑝𝑖𝑖superscriptsubscript𝑗0𝑘subscript𝑝𝑖𝑗superscriptsubscript𝑗0𝑘subscript𝑝𝑗𝑖subscript𝑝𝑖𝑖mIoU=\dfrac{1}{k+1}\sum_{i=0}^{k}\frac{{p}_{ii}}{\sum_{j=0}^{k}{p}_{ij}+\sum_{% j=0}^{k}{p}_{ji}-{p}_{ii}}.italic_m italic_I italic_o italic_U = divide start_ARG 1 end_ARG start_ARG italic_k + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_ARG . (4.1)

where pijsubscript𝑝𝑖𝑗{p}_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the predict i as j, and pjisubscript𝑝𝑗𝑖{p}_{ji}italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT represents the predict j as i, piisubscript𝑝𝑖𝑖{p}_{ii}italic_p start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT means to predict the correct value, k represents the number of categories.

Acc: Pixel accuracy refers to pixel accuracy, which is the simplest metric that represents the proportion of correctly labelled pixels to the total number of pixels.

PA=i=0kpiii=0kj=0kpij.𝑃𝐴superscriptsubscript𝑖0𝑘subscript𝑝𝑖𝑖superscriptsubscript𝑖0𝑘superscriptsubscript𝑗0𝑘subscript𝑝𝑖𝑗PA=\dfrac{\sum_{i=0}^{k}{p}_{ii}}{\sum_{i=0}^{k}\sum_{j=0}^{k}{p}_{ij}}.italic_P italic_A = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG . (4.2)

where piisubscript𝑝𝑖𝑖{p}_{ii}italic_p start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT means to predict the correct value, and pijsubscript𝑝𝑖𝑗{p}_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT means to predict i to j.k represents the number of categories.

4.2 Implementation Details

We implemented and trained our proposed network model using the PyTorch framework. To enhance the diversity of the training data, we applied random scaling and mirroring. Subsequently, all RGB and depth images were resized to 480×480480480480\times 480480 × 480 for network inputs, and semantic labels were adjusted to sizes of 480×480480480480\times 480480 × 480, 240×240240240240\times 240240 × 240, 120×120120120120\times 120120 × 120, and 60×60606060\times 6060 × 60 for deep supervision training. As the backbone for our encoder, we utilized a pre-trained ResNet50 [39] from the ImageNet classification dataset [40]. To refine the network structure, following [41, 42, 43], we adjust it by replacing the 7×7777\times 77 × 7 convolution in the input stem with three consecutive 3×3333\times 33 × 3 convolutions. The training was conducted on an NVIDIA GeForce GTX 3090 GPU using stochastic gradient descent optimization. Parameters were set with a batch size of 6, an initial learning rate of 0.003, 500 epochs, and momentum and weight decay values of 0.9 and 0.0005, respectively.

4.3 Quantitative Results on NYUv2 and SUN RGB-D

Firstly, we compare the proposed method against existing approaches using the NYUv2 dataset.

Table 1: MIPANet compared to the state-of-the-art methods on the NYUDv2 dataset.
Model Method Backbone mIoU(%) Pix.Acc(%)
ResNet34 IEMNet[44] Res34NBt1D 51.3 76.8
ResNet18 ESANet[16] 2 ×\times× R18 48.2 -
RDFNet[7] 2 ×\times× R50 47.7 74.8
ACNet[15] 3 ×\times× R50 48.3 -
SA-Gate[45] 2 ×\times× R50 50.4 -
ResNet50
ESANet 2 ×\times× R50 50.5 -
DynMM[46] R50 51.0 -
RedNet[8] 2 ×\times× R50 47.2 -
SGNet[47] R101 49.6 75.6
ResNet101 RDFNet 2 ×\times× R101 49.1 75.6
ShapeConv[48] R101 51.3 76.4
Baseline 2 ×\times× R50 47.4 75.1
ResNet50
Ours(MIPA) 2 ×\times× R50 51.9 (+ 4.5%) 77.2 (+ 2.1%)
Table 2: MIPANet compared to the state-of-the-art methods on the SUN RGB-D dataset.
Model Method Backbone mIoU(%) Pix.Acc(%)
ResNet34 EMSANet[49] 2 ×\times× R34 48.5 -
IEMNet[44] Res34NBt1D 48.3 81.9
ACNet[15] 3 ×\times× R50 48.1 -
ResNet50 ESANet[16] 2 ×\times× R50 48.3 -
RedNet[8] 2 ×\times× R50 47.8 81.3
SGNet[47] R101 47.1 81.0
CANet[50] R101 48.3 82.0
ResNet101
CGBNet[51] R101 48.2 82.3
ShapeConv[48] R101 48.6 82.2
ResNet152 RDFNet[7] 2 ×\times× R152 47.7 81.5
Baseline 2 ×\times× R50 45.5 81.1
ResNet50
Ours(MIPA) 2 ×\times× R50 48.8 (+ 3.3%) 82.3 (+ 1.2%)

Table 1 illustrates our superior performance regarding mIoU and Acc metrics compared to other methods. Specifically, with ResNet50 serving as the encoder in our network, the pixel accuracy and average intersection-over-union (mIoU) for semantic segmentation on the NYUv2 test set reached 77.2% and 51.9%. For example, contrasting our method with RDFNet, which also employs ResNet50, our approach showcased a notable improvement of 2.4% in accuracy (Acc) and 3.2% in mean IoU (mIoU). This underscores a significant enhancement in segmentation accuracy achieved by our MIPANet, leveraging the identical ResNet50 architecture. Compared to SGNet, which utilizes ResNet101, our model demonstrates an improvement of 1.6% and 2.3% in Acc and mIoU, respectively. Notably, our ResNet50 outperforms ResNet101, showcasing the effectiveness of our carefully designed network structure and the multi-modal feature fusion module. These improvements in segmentation results are achieved without the need for complex networks, leading to reduced training time. Here, ”R” represents ResNet, and the symbol ’-’ signifies that the comparison evaluated no accuracy metrics. We further compared different network structures across various methods, explicitly noting that ESANet incorporates two ResNet18s as the backbone, while ACNet utilizes three ResNet50 as the backbone.

Then, we comprehensively compared our proposed algorithm with existing methods on the SUN RGB-D dataset. As depicted in Table 2, our approach consistently achieves higher mIoU scores on the SUN RGB-D dataset than all other methods. For instance, MIPANet outperforms SGNet, exhibiting an improvement of 1.3% and 1.7% in Acc and mIoU, respectively. This observation underscores our module’s ability to maintain superior segmentation accuracy, even when dealing with the extensive SUN RGB-D dataset. For different backbone architectures, ResNet101 generally demonstrates better performance than ResNet50, while ResNet50, in turn, outperforms ResNet18. We opted for ResNet50 as our backbone to achieve commendable performance with reduced training time compared to ResNet101. Notably, our method exhibits an increase of 4.5% and 2.1% in mIoU and Acc on both datasets, respectively, compared to the baseline, as highlighted in the red section of the tables.

4.4 Visualization results on NYUv2

To visually highlight the advancements made by our method in the realm of RGB-D semantic segmentation, we provide visualization results of the network on the NYUv2 dataset.

Refer to caption
Figure 5: Visual result of MIPANet on NYUv2 dataset. The optimization effect is particularly notable within the red dotted box.

Compared to the baseline, our method has significantly improved segmentation results. Notably, the dashed box in the figure showcases our network enriched with depth information accurately distinguishes objects from the background. For instance, in the visualization results of the fourth image, the baseline erroneously categorizes the mirror on the wall as part of the background, in the visualization results of the second image, the ACNet and the ESANet mistook the carpet for a part of the floor. In contrast, leveraging depth information, our network discerns the distinct distance information of the mirror from the background, leading to a correct classification of the mirror. Fig. 5 illustrates the visualization results of the proposed algorithm on the NYUv2 dataset. From left to right, the columns depict the RGB image, the Depth image, the baseline model results with ResNet50 backbone, ACNet, ESANet, MIPANet (Ours), and Ground Truth. The algorithm presented in this paper has achieved precise segmentation outcomes in diverse and intricate indoor scenes. Moreover, it excels in segmenting challenging objects like ”carpets” and ”books” while delivering finer-edge segmentation results.

4.5 Ablation Study on PAM and MIM on NYUv2

We conducted ablation experiments comparing PAM and MIM on the NYUv2 dataset as show in Fig. 6.

Refer to caption
Figure 6: Ablation Study on PAM and MIM. When set to B1, the best segmentation result is 51.9%

Specifically, the RGB feature and depth feature input PAM to obtain 𝑭~RGBnsuperscriptsubscript~𝑭𝑅𝐺𝐵𝑛\tilde{\bm{F}}_{RGB}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_R italic_G italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝑭~Depnsuperscriptsubscript~𝑭𝐷𝑒𝑝𝑛\tilde{\bm{F}}_{Dep}^{n}over~ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_D italic_e italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Given the modality differences, we addressed the parameter-sharing issue in PAM. Moreover, considering the impact of network depth on information interaction, we applied MIM in both Layer 3 and Layer 4 of the encoder. Fig. 6 presents the results of ablation studies on PAM and MIM using different configurations (B1-B4) on the NYUv2 dataset: B1 (PAM without shared parameters and MIM used on ResNetLayer4), B2 (PAM with shared parameters and MIM used on ResNetLayer4), B3 (PAM without shared parameters and MIM used on ResNetLayer3 and ResNetLayer4), B4 (PAM with shared parameters and MIM used on ResNetLayer3 and ResNetLayer4). Achieving the best results involves using PAM without shared parameters in ResNetLayer1-3 and MIM only in the last layer of the encoder, resulting in the highest mIoU of 51.9%.

4.6 Ablation Study on NYUv2 and SUN-RGBD

To investigate the impact of different modules on segmentation performance, we conducted ablation experiments on NYUv2 and SUN-RGBD datasets, as depicted in Table 3. ’\usym2713’ indicates the usage of a particular module, while ’\usym2715’ means not using the module. For instance, our PAM module exhibited a superiority of 1.5% and 0.9% over the baseline concerning mIoU and Acc indicators. Similarly, our MIM module demonstrated a superiority of 3.7% and 1.9% over the baseline regarding mIoU and Acc indicators. The result suggests that each proposed module can independently enhance segmentation accuracy.Our module surpasses the baseline in fusing cross-modal features, yielding superior results on both datasets. Using both PAM and MIM modules, we achieved the highest mIoU of 51.9% on the NYUv2 dataset and the highest mIoU of 48.8% on the SUN RGB-D dataset. The result highlights that our two designed modules can be collectively optimized to enhance segmentation accuracy.

Table 3: Ablation studies on NYUDv2 and SUN-RGBD dataset for PAM and MIM
Method PAM MIM NYUv-2 SUN-RGBD \bigstrut[t]
\bigstrut[b]
mIoU(%) Acc(%) mIoU(%) Acc(%) \bigstrut[t]
\bigstrut[b]
Baseline \usym2715 \usym2715 47.4 75.1 45.5 81.1 \bigstrut[t]
\bigstrut[b]
Ours \usym2713 \usym2715 48.9 76.0 47.9 81.3 \bigstrut[t]
\usym2715 \usym2713 51.1 77.0 48.3 81.5
\usym2713 \usym2713 51.9 77.2 48.8 82.3 \bigstrut[b]

5 Conclusions

In this paper, we tackle a fundamental challenge in RGB-D semantic segmentation—efficiently fusing features from two distinct modes. We designed an innovative Multi-modal Interaction and Pooling Attention network, which uses a small and flexible PAM module in the shallow layer of the network to enhance the feature extraction capability of the network and uses a MIM module in the last layer of the network to integrate RGB features and depth features effectively. We use the complementary information between RGB and depth mode to improve the accuracy of semantic segmentation in indoor scenes. In future work, we will extend our method to enhance its generalization ability in RGB-D semantic segmentation. Furthermore, we anticipate performance improvements by integrating tasks like depth estimation into the existing framework, facilitating collaborative network interactions. limitation. Our method’s effectiveness has been exclusively validated on CNN networks, but we haven’t verified other network architectures, such as Transformer. In addition, during the segmentation verification on the test set, the requirement to input both RGB and depth images limits the network’s generalization ability. Consequently, the network may not achieve optimal segmentation results for datasets lacking depth information.

Acknowledgments

All sources of funding of the study must be disclosed.

Conflict of interest

The authors declare there is no conflict of interest.

References

  • [1] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [2] M. Li, M. Wei, X. He, F. Shen, Enhancing part features via contrastive attention module for vehicle re-identification, in: 2022 IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 1816–1820.
  • [3] Z. Zhang, Microsoft kinect sensor and its effect, IEEE multimedia 19 (2) (2012) 4–10.
  • [4] Y. He, W.-C. Chiu, M. Keuper, M. Fritz, Std2p: Rgbd semantic segmentation using spatio-temporal data-driven pooling, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7158–7167.
  • [5] C. Couprie, C. Farabet, L. Najman, Y. LeCun, Indoor semantic segmentation using depth information, arXiv preprint arXiv:1301.3572 (2013).
  • [6] S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from rgb-d images for object detection and segmentation, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, Springer, 2014, pp. 345–360.
  • [7] S.-J. Park, K.-S. Hong, S. Lee, Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 4990–4999.
  • [8] J. Jiang, L. Zheng, F. Luo, Z. Zhang, Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation, arXiv preprint arXiv:1806.01054 (2018).
  • [9] D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650–2658.
  • [10] A. Wang, J. Lu, G. Wang, J. Cai, T.-J. Cham, Multi-modal unsupervised feature learning for rgb-d scene labeling, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 453–467.
  • [11] F. Liu, C. Shen, G. Lin, Deep convolutional neural fields for depth estimation from a single image, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5162–5170.
  • [12] J. Hu, Z. Huang, F. Shen, D. He, Q. Xian, A bag of tricks for fine-grained roof extraction, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023.
  • [13] J. Hu, Z. Huang, F. Shen, D. He, Q. Xian, A rubust method for roof extraction and height estimation, in: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, IEEE, 2023.
  • [14] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture, in: Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13, Springer, 2017, pp. 213–228.
  • [15] X. Hu, K. Yang, L. Fei, K. Wang, Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation, in: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 1440–1444.
  • [16] D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, H.-M. Gross, Efficient rgb-d semantic segmentation for indoor scene analysis, in: 2021 IEEE international conference on robotics and automation (ICRA), IEEE, 2021, pp. 13525–13531.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
  • [18] F. Shen, M. Wei, J. Ren, Hsgnet: Object re-identification with hierarchical similarity graph network, arXiv preprint arXiv:2211.05486 (2022).
  • [19] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for scene segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
  • [20] F. Shen, J. Zhu, X. Zhu, J. Huang, H. Zeng, Z. Lei, C. Cai, An efficient multiresolution network for vehicle reidentification, IEEE Internet of Things Journal 9 (11) (2022) 9049–9059.
  • [21] F. Shen, X. Peng, L. Wang, X. Zhang, M. Shu, Y. Wang, Hsgm: A hierarchical similarity graph module for object re-identification, in: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2022, pp. 1–6.
  • [22] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  • [23] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, X. Tang, Residual attention network for image classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3156–3164.
  • [24] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [25] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, Eca-net: Efficient channel attention for deep convolutional neural networks, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11531–11539.
  • [26] C. Qiao, F. Shen, X. Wang, R. Wang, F. Cao, S. Zhao, C. Li, A novel multi-frequency coordinated module for sar ship detection, in: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2022, pp. 804–811.
  • [27] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, W. Liu, Ccnet: Criss-cross attention for semantic segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 603–612.
  • [28] Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, T. Harada, Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 5108–5115.
  • [29] F. Shen, X. Du, L. Zhang, J. Tang, Triplet contrastive learning for unsupervised vehicle re-identification, arXiv preprint arXiv:2301.09498 (2023).
  • [30] Q. Zhang, S. Zhao, Y. Luo, D. Zhang, N. Huang, J. Han, Abmdrnet: Adaptive-weighted bi-directional modality difference reduction network for rgb-t semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2633–2642.
  • [31] F. Shen, X. Shu, X. Du, J. Tang, Pedestrian-specific bipartite-aware similarity learning for text-based person retrieval, in: Proceedings of the 31th ACM International Conference on Multimedia, 2023.
  • [32] Y. Sun, W. Zuo, M. Liu, Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes, IEEE Robotics and Automation Letters 4 (3) (2019) 2576–2583.
  • [33] K. Xiang, K. Yang, K. Wang, Polarization-driven semantic segmentation via efficient attention-bridged fusion, Optics Express 29 (4) (2021) 4802–4820.
  • [34] F. Shen, J. Zhu, X. Zhu, Y. Xie, J. Huang, Exploring spatial significance via hybrid pyramidal graph network for vehicle re-identification, IEEE Transactions on Intelligent Transportation Systems 23 (7) (2022) 8793–8804.
  • [35] F. Shen, Y. Xie, J. Zhu, X. Zhu, H. Zeng, Git: Graph interactive transformer for vehicle re-identification, IEEE Transactions on Image Processing (2023).
  • [36] Z. Zhuang, R. Li, K. Jia, Q. Wang, Y. Li, M. Tan, Perception-aware multi-sensor fusion for 3d lidar semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16260–16270.
  • [37] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, Springer, 2012, pp. 746–760.
  • [38] S. Song, S. P. Lichtenberg, J. Xiao, Sun rgb-d: A rgb-d scene understanding benchmark suite, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
  • [39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (2015) 211–252.
  • [41] X. Fu, F. Shen, X. Du, Z. Li, Bag of tricks for “vision meet alage” object detection challenge, in: 2022 6th International Conference on Universal Village (UV), IEEE, 2022, pp. 1–4.
  • [42] F. Shen, X. He, M. Wei, Y. Xie, A competitive method to vipriors object detection challenge, arXiv preprint arXiv:2104.09059 (2021).
  • [43] F. Shen, Z. Wang, Z. Wang, X. Fu, J. Chen, X. Du, J. Tang, A competitive method for dog nose-print re-identification, arXiv preprint arXiv:2205.15934 (2022).
  • [44] X. Xu, J. Liu, H. Liu, Interactive efficient multi-task network for rgb-d semantic segmentation, Electronics 12 (18) (2023) 3943.
  • [45] X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, G. Zeng, Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation, in: European Conference on Computer Vision, Springer, 2020, pp. 561–577.
  • [46] Z. Xue, R. Marculescu, Dynamic multimodal fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2575–2584.
  • [47] L.-Z. Chen, Z. Lin, Z. Wang, Y.-L. Yang, M.-M. Cheng, Spatial information guided convolution for real-time rgbd semantic segmentation, IEEE Transactions on Image Processing 30 (2021) 2313–2324.
  • [48] J. Cao, H. Leng, D. Lischinski, D. Cohen-Or, C. Tu, Y. Li, Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7068–7077.
  • [49] D. Seichter, S. B. Fischedick, M. Köhler, H.-M. Groß, Efficient multi-task rgb-d scene analysis for indoor environments, in: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–10.
  • [50] Q. Tang, F. Liu, T. Zhang, J. Jiang, Y. Zhang, Attention-guided chained context aggregation for semantic segmentation, Image and Vision Computing 115 (2021) 104309.
  • [51] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, G. Wang, Semantic segmentation with context encoding and multi-path decoding, IEEE Transactions on Image Processing 29 (2020) 3520–3533.