MINA: Multimodal Intention Analysis of Social Media Posts via LLM-Guided Audio-Visual-Text Reasoning

30 Pages Posted: 24 Dec 2025

See all articles by Feihong Lu

Feihong Lu

affiliation not provided to SSRN

Tao Yang

affiliation not provided to SSRN

Ziqin Zhu

affiliation not provided to SSRN

Yudi Huang

affiliation not provided to SSRN

Shiqi Gao

affiliation not provided to SSRN

Yangyifei Luo

affiliation not provided to SSRN

Zengxu Wang

affiliation not provided to SSRN

Qian Li

Beijing University of Posts and Telecommunications

Qingyun Sun

affiliation not provided to SSRN

Jianxin Li

Beihang University (BUAA) - Beijing Advanced Innovation Center for Big Data and Brain Computing

Abstract

Social media platforms have evolved into environments where users routinely express opinions and emotions through multimodal content, including text, images, videos, and audio.  However, existing methods struggle to infer "what the author really wants to express". This difficulty stems from implicit user intentions, limited multimodal social media data, and inconsistencies across modalities in conveying user intention. To address these challenges, we propose an Multimodal social INtention Analysis framework, named MINA, which can accurately infer the underlying posting intentions of multimodal social posts.  Specifically, MINA uses LLMs and MLLMs to jointly reason over textual, visual, audio, and video inputs.  To distinguish the importance of each modality, MINA introduces an "intention analysis strategy generation and evaluation" module. This module employs two specialized LLMs for dynamic modality priority ranking and multidimensional evaluation, enhancing the diversity and robustness of intention analysis. Moreover, the learned intention analysis strategy guides the LLM in generating user intentions, which are then automatically screened by a filter-LLM, reducing the workload of manual annotation. By applying MINA to a public social media dataset, we construct a multimodal intention knowledge base containing 55K intentions derived from 5,500 posts with manual annotations. We use this resource to assess intention quality and benchmark widely used LLMs and MLLMs. We further evaluate on TwiBot and sarcasm detection, demonstrating substantial downstream gains from incorporating intention knowledge.

Keywords: Social Media, Intention Analysis, Multimodal Understanding, Knowledge Distillation

Suggested Citation

Lu, Feihong and Yang, Tao and Zhu, Ziqin and Huang, Yudi and Gao, Shiqi and Luo, Yangyifei and Wang, Zengxu and Li, Qian and Sun, Qingyun and Li, Jianxin, MINA: Multimodal Intention Analysis of Social Media Posts via LLM-Guided Audio-Visual-Text Reasoning. Available at SSRN: https://ssrn.com/abstract=5962291 or http://dx.doi.org/10.2139/ssrn.5962291

Feihong Lu

affiliation not provided to SSRN ( email )

Tao Yang

affiliation not provided to SSRN ( email )

Ziqin Zhu

affiliation not provided to SSRN ( email )

Yudi Huang

affiliation not provided to SSRN ( email )

Shiqi Gao

affiliation not provided to SSRN ( email )

Yangyifei Luo

affiliation not provided to SSRN ( email )

Zengxu Wang

affiliation not provided to SSRN ( email )

Qian Li

Beijing University of Posts and Telecommunications ( email )

Beijing
China

Qingyun Sun

affiliation not provided to SSRN ( email )

Jianxin Li (Contact Author)

Beihang University (BUAA) - Beijing Advanced Innovation Center for Big Data and Brain Computing ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
59
Abstract Views
117
Rank
989,921
PlumX Metrics