A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Fram Motion and Shared Attention Weights
19 Pages Posted: 27 Dec 2024
Abstract
Dynamic hand gesture recognition based on computer vision aims at enabling computers to understand the semantic meaning conveyed by hand gestures in videos. Existing methods predominately rely on spatiotemporal attention mechanisms to extract hand motion features in a large spatiotemporal scope. However, they cannot accurately focus on the moving hand region for hand feature extraction because frame sequences contain a substantial amount of redundant information. Although multimodal techniques can extract a wider variety of hand feature, they are less successful at utilizing information interactions between various modalities for accurate feature extraction. To address these challenges, this study proposes a multimodal hand gesture recognition model combining inter-frame motion and shared attention weights. By jointly using an inter-frame motion attention mechanism and adaptive down-sampling, the spatiotemporal search scope can be effectively narrowed down to the hand-related regions based on the characteristic of hands exhibiting obvious movements. The proposed inter-modal attention weights loss, meanwhile, allows the depth modality and the RGB modality to share the attention weights so that each modality can use the attention weights of other modalities to adjust its own attention weights. Experimental results on the EgoGesture, NVGesture, and Jester datasets demonstrate the superiority of our proposed model over existing state-of-the-art methods in terms of hand motion feature extraction and hand gesture recognition accuracy.
Keywords: Hand gesture recognition, attention mechanisms, spatiotemporal scope, multimodal techniques
Suggested Citation: Suggested Citation