Modeling Speaker-Specific Long-Term Context for Emotion Recognition in Conversation
11 Pages Posted: 27 Mar 2025
Abstract
Emotion recognition in conversation (ERC) is essential for enabling empathetic responses and fostering harmonious human-computer interaction. Modeling speaker-specific temporal dependencies can enhance the capture of speaker-sensitive emotional representations, thereby improving the understanding of emotional dynamics among speakers within a conversation. However, prior research has primarily focused on information available during speaking moments, neglecting contextual cues during silent moments, leading to incomplete and discontinuous representation of each speaker's emotional context. This study addresses these limitations by proposing a novel framework named the Speaker-specific Long-term Context Encoding Network (SLCNet) for the ERC task. SLCNet is designed to capture the complete speaker-specific long-term context, including both speaking and non-speaking moments. Specifically, an attention-based multimodal fusion network is first employed to dynamically focus on key modalities for effective multimodal fusion. Then, two well-designed graph neural networks are utilized for feature completion by leveraging intra-speaker temporal context and intel-speaker interaction influence, respectively. Finally, a shared LSTM models the temporally complete and speaker-sensitive context for each speaker. The proposed SLCNet is jointly optimized for multiple speakers and trained in an end-to-end manner. Extensive experiments on benchmark datasets demonstrate the superior performance of SLCNet and its ability to effectively complete emotional representations during silent moments, highlighting its potential to advance ERC research.
Keywords: Emotion recognition in conversation, Speaker-specific context encoding, Graph neural networks, Affective empathy
Suggested Citation: Suggested Citation