Contrastive Topic-Enhanced Network for Video Captioning

Zeng, Yawen; Wang, Yiru; Liao, Dongliang; Li, Gongfu; Xu, Jin; Xu, Xiangmin; Liu, Bo; Man, Hong

doi:10.2139/ssrn.4399138

Download This Paper

Open PDF in Browser

Add Paper to My Library

Contrastive Topic-Enhanced Network for Video Captioning

26 Pages Posted: 29 Mar 2023

See all articles by Yawen Zeng

In the task of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most the existing works only adopt transcripts as an additional modality, ignoring their huge potential in capturing high-level semantics, such as \textbf{\textit{multi-modal topics}}. In fact, transcripts as a textual attribute derived from the video, reflect the same high-level topics as the video content. Nonetheless, how to resolve the heterogeneity of multi-modal topics is still under-investigated and worth exploring. In this paper, we introduce a contrastive topic-enhanced network to consistently model heterogeneous topics, that is, inject an alignment module in advance, to learn a comprehensive latent topic space and guide caption generation.

Specifically, our method includes a local semantic alignment module and a global topic fusion module. In the local semantic alignment module, a fine-grained semantic alignment at the clip-sentence granularity reduces the semantic gap between modalities. Then, an instance-level contrastive task accomplishes unannotated clip-sentence alignment. In the global topic fusion module, multi-modal topics are produced by a shared variational autoencoder. Among them, topic-level contrastive learning is adopted to make multi-modal topics more distinguishable. Finally, a description is generated under an end-to-end transformer architecture. Extensive experiments have verified the effectiveness of our solution.

Keywords: Video Captioning, Multi-modal Topic, Contrastive Learning, Multi-modal Video Understanding

Suggested Citation: Suggested Citation

Zeng, Yawen and Wang, Yiru and Liao, Dongliang and Li, Gongfu and Xu, Jin and Xu, Xiangmin and Liu, Bo and Man, Hong, Contrastive Topic-Enhanced Network for Video Captioning. Available at SSRN: https://ssrn.com/abstract=4399138 or http://dx.doi.org/10.2139/ssrn.4399138