Vision Foundation Model Guided Multi-Modal Fusion Network for Remote Sensing Semantic Segmentation

44 Pages Posted: 25 Jun 2024

See all articles by Chen Pan

Chen Pan

Nanjing Forestry University

Xijian Fan

Nanjing Forestry University

Tardi Tjahjadi

University of Warwick

Haiyan Guan

Nanjing University of Information Science and Technology

Qiaolin Ye

Nanjing Forestry University

Liyong Fu

affiliation not provided to SSRN

Ruili Wang

Massey University

Abstract

With the rapid development of Earth observation sensors, the fusion of remote sensing (RS) data in multi-modal semantic segmentation has garnered significant research focus in recent years. The fusion of multi-modal data presents challenges due to discrepancies in image acquisition mechanisms among different sensors, leading to misalignment issues. To mitigate this challenge, this paper presents VSGNet, a novel multi-modal fusion framework designed for RS semantic segmentation. The work aims to utilise vision structure guidance derived by vision foundation model for accurate segmentation without the need for auxiliary sensors. Specifically, the framework incorporates a cross-modal collaborative network for feature embedding that blends a convolutional neural network and vision transformer to simultaneously capture both local information and long-range dependencies from the input modalities. Subsequently, a multi-scale cross-modal feature fusion comprising fusion enhancement and feature re-calibration modules is proposed to emphasise the adaptive multiscale interaction of diverse complementary cues between each modality while suppressing the impact of noise and uncertainties present in RS data. Extensive experiments conducted on five diverse RS datasets, i.e., ISPRS Potsdam, ISPRS Vaihingen, LoveDA, iSAID and Tree Mapping, demonstrate VSGNet outperforms state-of-the-art RS semantic segmentation models. The source code for implementing VSGNet and Tree Mapping dataset will be publicly available at https://github.com/Pcccc1/VSGNet.

Keywords: Semantic segmentation, Cross-Modal Fusion, Vision Foundation Model, Land cover mapping

Suggested Citation

Pan, Chen and Fan, Xijian and Tjahjadi, Tardi and Guan, Haiyan and Ye, Qiaolin and Fu, Liyong and Wang, Ruili, Vision Foundation Model Guided Multi-Modal Fusion Network for Remote Sensing Semantic Segmentation. Available at SSRN: https://ssrn.com/abstract=4876040 or http://dx.doi.org/10.2139/ssrn.4876040

Chen Pan

Nanjing Forestry University ( email )

159 Longpan Rd
Nanjing, 210037
China

Xijian Fan (Contact Author)

Nanjing Forestry University ( email )

159 Longpan Rd
Nanjing, 210037
China

Tardi Tjahjadi

University of Warwick ( email )

Gibbet Hill Rd.
Coventry, CV4 8UW
United Kingdom

Haiyan Guan

Nanjing University of Information Science and Technology ( email )

Nanjing
China

Qiaolin Ye

Nanjing Forestry University ( email )

159 Longpan Rd
Nanjing, 210037
China

Liyong Fu

affiliation not provided to SSRN ( email )

No Address Available

Ruili Wang

Massey University ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
72
Abstract Views
248
Rank
703,893
PlumX Metrics