Neighbor Patches Merging Reduces Spatial Redundancy of Nature Images
29 Pages Posted: 13 Dec 2023
Abstract
The introduction of the Transformer architecture in Computer Vision has unified the processing of image and text data. However, Transformer networks encounter the quadratic complexity of computation with respect to the sequence length. To mitigate this challenge, the Vision Transformer (ViT) dissects images into patches, embedding them into tokens for network input and thereby reducing the sequence length. This study leverages spatial redundancy in nature images and incorporates adaptive within images. The proposed solution introduces the Neighbor Patch Merging (NEPAM) method, which merges the image patches at the network’s inception. NEPAM effectively reduces sequence length and accelerates inference without necessitating alterations to the networks. Furthermore, we observe that merging patches leads to the loss of position embeddings and accuracy/ To address this, we propose Multi-Scale Relative Position Embeddings (MS-RPE) to model the position relationship between patches with adaptive sizes. Both the NEPAM method and MS-RPE can be seamlessly integrated into the network, enabling more flexible model deployment. Experiments demonstrate that applying NEPAM and MS-RPE to Deit-Small models results in a 2.26x speedup with an accuracy loss of 2.44%, without the necessaity of retraining for a fixed pruning rate.
Keywords: Vision Transformer, Token Merging, Position Embeddings, Spatial Redundancy
Suggested Citation: Suggested Citation