Self-Supervised Transformer for Infrared and Visible Image Fusion
22 Pages Posted: 17 Nov 2022
Abstract
Existing infrared and visible image fusion methods usually use hand-designed or simple convolution based fusion strategies which cannot model the contextual relationships between infrared and visible images explicitly. To this end, in this paper, we propose a Transformer based feature fusion network to model the contextual relationship of the two modalities for robust image fusion. Specifically, our fusion network consists of a detail self-attention module to capture the detail information of each modality and a saliency cross attention module to model contextual relationships between the two modalities. Since these two attention modules can obtain the pixel-level global dependencies, the fusion network has a powerful detail representation ability which is critical to the pixel-level's image generation task. What's more, to solve the slight misaligned problem of the source image pairs, we propose a deformable convolution based feature align network which is beneficial for reducing artifacts. Due to the infrared and visible image fusion task has not ground truth, we design a self-supervised multi-task loss which contains a structure similarity loss, an intensity loss, and a gradient loss to train the proposed method end-to-end. Extensive experiments on four benchmarks demonstrate that the proposed method achieves competitive performance compared with state-of-the-art methods.
Keywords: Self-supervised, Transformer, Image fusion, Deformable convolution
Suggested Citation: Suggested Citation