Multi-Level Alignment for Few-Shot Temporal Action Localization
27 Pages Posted: 29 Jan 2023
Abstract
Temporal action localization (TAL) which aims to localize actions occurring in a long untrimmed video, requires a large number of annotated training data. However, in real-life applications, it is very expensive to obtain segment-level annotations for large-scale datasets and there also exists an incomprehensible number of action classes that are not practical. To overcome this challenge, we present a novel few-shot learning method that localizes temporal action for previously unseen novel classes with only a few training samples. Unlike previous methods that do not exploit the alignment of visual information at each temporal location, we propose a novel multi-level encoder cosine-similarity alignment module that implicitly learns the spatiotemporal context alignment for long untrimmed videos. Towards this objective, our proposed method adopts an episodic-based training scheme to learn the alignment of similar video snippets between videos belonging to the same class with few training examples. At test time, this learned aligned context information is then adapted to novel unseen classes. Experimental results on two standard datasets ActivityNet1.3 and THUMOS-14 show that our proposed method outperforms other state-of-the-art methods for few-shot temporal action localization with single and multiple action instances on the ActivityNet-1.3 dataset and achieves competitive results on the THUMOS-14 dataset.
Keywords: few-shot learning, Temporal action localization, Feature Alignment, Cosine similarity
Suggested Citation: Suggested Citation