From Visual Features to Key Concepts: A Dynamic and Static Concept-Driven Approach for Video Captioning

7 Pages Posted: 28 Jun 2024

See all articles by Xin Ren

Xin Ren

Donghua University

Yufeng Han

Donghua University

Bing Wei

Donghua University

Xue-song Tang

Donghua University

Kuangrong Hao

Donghua University

Abstract

In video, concepts are fundamental units representing objects and events. Video captioning tasks involve identifying key relevant concepts in order to create an accurate and summary-based description for videos, while simultaneously ignoring less important or irrelevant ones. In mainstream approaches, an encoder-decoder architecture is typically employed where a visual model serves as the encoder to extract video features, and a language model acts as the decoder to generate captions. While numerous pre-trained visual models and large-scale language models (LLMs) exhibit outstanding capabilities, harnessing them effectively in video captioning applications faces a hurdle. This stems from the extraction of semantically irrelevant features from visual models, which can heighten hallucination issues within LLMs, subsequently leading to the inclusion of fictitious concepts in the generated captions. In response to the aforementioned challenges, we introduce Dynamic and Static Concept-driven video captioning model(DiSCo), a novel method that efficiently harnesses pre-trained models for generating accurate and coherent captions. Built on the encoder-decoder architecture, DiSCo uniquely integrates a Semantic Feature Extractor (SFE) and a Static-Dynamic Concept Detector (S-DCD), driven by concepts to extract semantic features and furnish the large language model with pertinent prior knowledge. We freeze the parameters of both the visual model and the LLM, only training SFE and S-DCD to filter the features from the visual model and identify critical concepts to inform the LLM. Extensive evaluations across MSVD and MSR-VTT datasets demonstrate that DiSCo consistently outperforms previous methods, delivering substantial performance improvements in video captioning tasks.

Keywords: Video captioning, Video representation, Concept detection, Knowledge transformer

Suggested Citation

Ren, Xin and Han, Yufeng and Wei, Bing and Tang, Xue-song and Hao, Kuangrong, From Visual Features to Key Concepts: A Dynamic and Static Concept-Driven Approach for Video Captioning. Available at SSRN: https://ssrn.com/abstract=4879049 or http://dx.doi.org/10.2139/ssrn.4879049

Xin Ren

Donghua University ( email )

Shanghai 200051
China

Yufeng Han

Donghua University ( email )

Shanghai 200051
China

Bing Wei

Donghua University ( email )

Shanghai 200051
China

Xue-song Tang

Donghua University ( email )

Shanghai 200051
China

Kuangrong Hao (Contact Author)

Donghua University ( email )

Shanghai 200051
China

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
27
Abstract Views
143
PlumX Metrics