MINA: Multimodal Intention Analysis of Social Media Posts via LLM-Guided Audio-Visual-Text Reasoning
30 Pages Posted: 24 Dec 2025
Abstract
Social media platforms have evolved into environments where users routinely express opinions and emotions through multimodal content, including text, images, videos, and audio. However, existing methods struggle to infer "what the author really wants to express". This difficulty stems from implicit user intentions, limited multimodal social media data, and inconsistencies across modalities in conveying user intention. To address these challenges, we propose an Multimodal social INtention Analysis framework, named MINA, which can accurately infer the underlying posting intentions of multimodal social posts. Specifically, MINA uses LLMs and MLLMs to jointly reason over textual, visual, audio, and video inputs. To distinguish the importance of each modality, MINA introduces an "intention analysis strategy generation and evaluation" module. This module employs two specialized LLMs for dynamic modality priority ranking and multidimensional evaluation, enhancing the diversity and robustness of intention analysis. Moreover, the learned intention analysis strategy guides the LLM in generating user intentions, which are then automatically screened by a filter-LLM, reducing the workload of manual annotation. By applying MINA to a public social media dataset, we construct a multimodal intention knowledge base containing 55K intentions derived from 5,500 posts with manual annotations. We use this resource to assess intention quality and benchmark widely used LLMs and MLLMs. We further evaluate on TwiBot and sarcasm detection, demonstrating substantial downstream gains from incorporating intention knowledge.
Keywords: Social Media, Intention Analysis, Multimodal Understanding, Knowledge Distillation
Suggested Citation: Suggested Citation