Contrastive Topic-Enhanced Network for Video Captioning
26 Pages Posted: 29 Mar 2023
Abstract
In the task of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most the existing works only adopt transcripts as an additional modality, ignoring their huge potential in capturing high-level semantics, such as \textbf{\textit{multi-modal topics}}. In fact, transcripts as a textual attribute derived from the video, reflect the same high-level topics as the video content. Nonetheless, how to resolve the heterogeneity of multi-modal topics is still under-investigated and worth exploring. In this paper, we introduce a contrastive topic-enhanced network to consistently model heterogeneous topics, that is, inject an alignment module in advance, to learn a comprehensive latent topic space and guide caption generation.
Specifically, our method includes a local semantic alignment module and a global topic fusion module. In the local semantic alignment module, a fine-grained semantic alignment at the clip-sentence granularity reduces the semantic gap between modalities. Then, an instance-level contrastive task accomplishes unannotated clip-sentence alignment. In the global topic fusion module, multi-modal topics are produced by a shared variational autoencoder. Among them, topic-level contrastive learning is adopted to make multi-modal topics more distinguishable. Finally, a description is generated under an end-to-end transformer architecture. Extensive experiments have verified the effectiveness of our solution.
Keywords: Video Captioning, Multi-modal Topic, Contrastive Learning, Multi-modal Video Understanding
Suggested Citation: Suggested Citation