Havt: Hierarchical Attention Vision Transformer for Fine-Grained Visual Classification
30 Pages Posted: 10 Jun 2022
Abstract
Recently, Visual Transformer has made a breakthrough in the field of image recognition with its self-attention mechanism generating attention weights capable of extracting discriminative token information of each pixel block and connecting it to class token, making it suitable for fine-grained image classification. Nevertheless, the class token in the deep layer tends to ignore the local features between layers. In addition, the embedding layer feeds fixed-size patches into the network, inevitably introducing additional image noise. Therefore, we propose a Hierarchical Attentional Visual Transformer based on the Transformer framework. A data enhancement module is introduced to use attention weights as a guide, thus reducing the impact of noise from fixed-size pixel blocks. Next, the Hierarchical Attention Selection module is proposed to filter and fuse the tokens between each hierarchy to effectively guide the network to select discriminative tokens between hierarchy.The effectiveness of HAVT is finally validated on two general fine-grained datasets.
Keywords: Fine-Grained visual classification, Vision transformer, Hierarchical attention selection, Attention-guided data augmentation
Suggested Citation: Suggested Citation