Joint Cnn and Vision Transformer for Indoor Scene Recognition
7 Pages Posted: 6 Mar 2025
There are 3 versions of this paper
Joint Cnn and Vision Transformer for Indoor Scene Recognition
Joint Cnn and Vision Transformer for Indoor Scene Recognition
Joint Cnn and Vision Transformer for Indoor Scene Recognition
Abstract
Indoor scene recognition is a growing field with great potential in smart homes, robot navigation, and more. Although Convolutional Neural Networks have certain advantages in extracting low-level features and establishing local relationships, their indoor scene recognition accuracy is limited due to the lack of ability to establish long-range dependencies. Whereas Vision Transformer has the ability to establish long-range dependencies. Motivied by this, we propose a Joint Convolutional Neural Networks and Vision Transformer method (JCVT), which combines the advantages of both. First, we design a Local Enhancement Vision Transformer Module (LEVTM) to enhance the performance of Agent-CSWin Transformer by capturing rich local features. Second, to explore the semantic information contained in indoor scenes, we construct a Semantic Enhancement Convolutional Neural Networks Module (SECNNM), which employs ResNet50 (with the last classification layer removed) as an encoder and convolutional layers as a decoder. Third, to fully leverage the advantages of Convolutional Neural Networks and Vision Transformer, we integrate the LEVTM and SECNNM, as well as the original ResNet50 model to generate the final indoor scene representation. Extensive experiments on three benchmark indoor scene datasets demonstrate the superiority of the proposed method compared to state-of-the-art approaches.
Keywords: Convolutional neural networks, vision transformer, semantic information, indoor scene recognition.
Suggested Citation: Suggested Citation