Contrastive Topic-Enhanced Network for Video Captioning

26 Pages Posted: 29 Mar 2023

See all articles by Yawen Zeng

Yawen Zeng

Tencent Inc.

Yiru Wang

Tencent Inc.

Dongliang Liao

Tencent Inc.

Gongfu Li

Tencent Inc.

Jin Xu

South China University of Technology

Xiangmin Xu

South China University of Technology

Bo Liu

Auburn University

Hong Man

Stevens Institute of Technology

Abstract

In the task of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most the existing works only adopt transcripts as an additional modality, ignoring their huge potential in capturing high-level semantics, such as \textbf{\textit{multi-modal topics}}. In fact, transcripts as a textual attribute derived from the video, reflect the same high-level topics as the video content. Nonetheless, how to resolve the heterogeneity of multi-modal topics is still under-investigated and worth exploring. In this paper, we introduce a contrastive topic-enhanced network to consistently model heterogeneous topics, that is, inject an alignment module in advance, to learn a comprehensive latent topic space and guide caption generation.

Specifically, our method includes a local semantic alignment module and a global topic fusion module. In the local semantic alignment module, a fine-grained semantic alignment at the clip-sentence granularity reduces the semantic gap between modalities. Then, an instance-level contrastive task accomplishes unannotated clip-sentence alignment. In the global topic fusion module, multi-modal topics are produced by a shared variational autoencoder. Among them, topic-level contrastive learning is adopted to make multi-modal topics more distinguishable. Finally, a description is generated under an end-to-end transformer architecture. Extensive experiments have verified the effectiveness of our solution.

Keywords: Video Captioning, Multi-modal Topic, Contrastive Learning, Multi-modal Video Understanding

Suggested Citation

Zeng, Yawen and Wang, Yiru and Liao, Dongliang and Li, Gongfu and Xu, Jin and Xu, Xiangmin and Liu, Bo and Man, Hong, Contrastive Topic-Enhanced Network for Video Captioning. Available at SSRN: https://ssrn.com/abstract=4399138 or http://dx.doi.org/10.2139/ssrn.4399138

Yiru Wang

Tencent Inc. ( email )

Dongliang Liao

Tencent Inc. ( email )

Gongfu Li

Tencent Inc. ( email )

Jin Xu

South China University of Technology ( email )

Wushan
Guangzhou, AR 510640
China

Xiangmin Xu

South China University of Technology ( email )

Wushan
Guangzhou, AR 510640
China

Bo Liu

Auburn University ( email )

415 West Magnolia Avenue
Auburn, AL 36849
United States

Hong Man

Stevens Institute of Technology ( email )

Hoboken, NJ 07030
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
26
Abstract Views
188
PlumX Metrics