From Visual Features to Key Concepts: A Dynamic and Static Concept-Driven Approach for Video Captioning

Ren, Xin; Han, Yufeng; Wei, Bing; Tang, Xue-song; Hao, Kuangrong

doi:10.2139/ssrn.4879049

Download This Paper

Open PDF in Browser

Add Paper to My Library

From Visual Features to Key Concepts: A Dynamic and Static Concept-Driven Approach for Video Captioning

7 Pages Posted: 28 Jun 2024

See all articles by Xin Ren

In video, concepts are fundamental units representing objects and events. Video captioning tasks involve identifying key relevant concepts in order to create an accurate and summary-based description for videos, while simultaneously ignoring less important or irrelevant ones. In mainstream approaches, an encoder-decoder architecture is typically employed where a visual model serves as the encoder to extract video features, and a language model acts as the decoder to generate captions. While numerous pre-trained visual models and large-scale language models (LLMs) exhibit outstanding capabilities, harnessing them effectively in video captioning applications faces a hurdle. This stems from the extraction of semantically irrelevant features from visual models, which can heighten hallucination issues within LLMs, subsequently leading to the inclusion of fictitious concepts in the generated captions. In response to the aforementioned challenges, we introduce Dynamic and Static Concept-driven video captioning model(DiSCo), a novel method that efficiently harnesses pre-trained models for generating accurate and coherent captions. Built on the encoder-decoder architecture, DiSCo uniquely integrates a Semantic Feature Extractor (SFE) and a Static-Dynamic Concept Detector (S-DCD), driven by concepts to extract semantic features and furnish the large language model with pertinent prior knowledge. We freeze the parameters of both the visual model and the LLM, only training SFE and S-DCD to filter the features from the visual model and identify critical concepts to inform the LLM. Extensive evaluations across MSVD and MSR-VTT datasets demonstrate that DiSCo consistently outperforms previous methods, delivering substantial performance improvements in video captioning tasks.

Keywords: Video captioning, Video representation, Concept detection, Knowledge transformer

Suggested Citation: Suggested Citation

Ren, Xin and Han, Yufeng and Wei, Bing and Tang, Xue-song and Hao, Kuangrong, From Visual Features to Key Concepts: A Dynamic and Static Concept-Driven Approach for Video Captioning. Available at SSRN: https://ssrn.com/abstract=4879049 or http://dx.doi.org/10.2139/ssrn.4879049