MINA: Multimodal Intention Analysis of Social Media Posts via LLM-Guided Audio-Visual-Text Reasoning

Lu, Feihong; Yang, Tao; Zhu, Ziqin; Huang, Yudi; Gao, Shiqi; Luo, Yangyifei; Wang, Zengxu; Li, Qian; Sun, Qingyun; Li, Jianxin

doi:10.2139/ssrn.5962291

Download This Paper

Open PDF in Browser

Add Paper to My Library

MINA: Multimodal Intention Analysis of Social Media Posts via LLM-Guided Audio-Visual-Text Reasoning

30 Pages Posted: 24 Dec 2025

See all articles by Feihong Lu

Qian Li

Beijing University of Posts and Telecommunications

Jianxin Li

Beihang University (BUAA) - Beijing Advanced Innovation Center for Big Data and Brain Computing

Abstract

Social media platforms have evolved into environments where users routinely express opinions and emotions through multimodal content, including text, images, videos, and audio. However, existing methods struggle to infer "what the author really wants to express". This difficulty stems from implicit user intentions, limited multimodal social media data, and inconsistencies across modalities in conveying user intention. To address these challenges, we propose an Multimodal social INtention Analysis framework, named MINA, which can accurately infer the underlying posting intentions of multimodal social posts. Specifically, MINA uses LLMs and MLLMs to jointly reason over textual, visual, audio, and video inputs. To distinguish the importance of each modality, MINA introduces an "intention analysis strategy generation and evaluation" module. This module employs two specialized LLMs for dynamic modality priority ranking and multidimensional evaluation, enhancing the diversity and robustness of intention analysis. Moreover, the learned intention analysis strategy guides the LLM in generating user intentions, which are then automatically screened by a filter-LLM, reducing the workload of manual annotation. By applying MINA to a public social media dataset, we construct a multimodal intention knowledge base containing 55K intentions derived from 5,500 posts with manual annotations. We use this resource to assess intention quality and benchmark widely used LLMs and MLLMs. We further evaluate on TwiBot and sarcasm detection, demonstrating substantial downstream gains from incorporating intention knowledge.

Keywords: Social Media, Intention Analysis, Multimodal Understanding, Knowledge Distillation

Suggested Citation: Suggested Citation

Lu, Feihong and Yang, Tao and Zhu, Ziqin and Huang, Yudi and Gao, Shiqi and Luo, Yangyifei and Wang, Zengxu and Li, Qian and Sun, Qingyun and Li, Jianxin, MINA: Multimodal Intention Analysis of Social Media Posts via LLM-Guided Audio-Visual-Text Reasoning. Available at SSRN: https://ssrn.com/abstract=5962291 or http://dx.doi.org/10.2139/ssrn.5962291