Emog: Synthesizing Emotive Co-Speech 3d Gesture with Diffusion Model

Yin, Lianying; Wang, Yijun; He, Tianyu; Zhao, Wei; Jin, Xin; Lin, Jianxin

doi:10.2139/ssrn.4818829

Download This Paper

Open PDF in Browser

Add Paper to My Library

Emog: Synthesizing Emotive Co-Speech 3d Gesture with Diffusion Model

26 Pages Posted: 7 May 2024

See all articles by Lianying Yin

Co-speech 3D human gesture synthesis has received extensive attention, which aims to generate realistic 3D animations of human motion using audio signals as input. However, generating highly vivid motion remains a challenge due to the one-to-many nature between the speech content and gestures. This intricate association often leads to the oversight of certain gesture properties, notably the delicate differences of gesture emotion. Moreover, the complexity inherent in audio-driven 3D human gesture synthesis is compounded by the interaction of joint correlation and the temporal variation intrinsic to human gesture, often causing plain gesture generation. In this paper, we present a novel framework, dubbed \textbf{EM}otive \textbf{G}esture generation (EMoG), to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotional representations extracted from audio into the distribution modeling process of the diffusion model, explicitly increasing the generation diversity; 2) To enable the emotive gesture generation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Temporal-dynamics and Joint-correlation aware Transformer (TJFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial advantages in gesture synthesis. Furthermore, our approach exhibits the additional capabilities of enabling emotion transfer and affording fine-grained style control over specific body parts. This remarkable advancement paves the way for expressive gesture generation with heightened fidelity, opening up new possibilities for immersive experiences in the realm of 3D human gesture synthesis.

Keywords: Co-speech gesture generation, Diffusion model

Suggested Citation: Suggested Citation

Yin, Lianying and Wang, Yijun and He, Tianyu and Zhao, Wei and Jin, Xin and Lin, Jianxin, Emog: Synthesizing Emotive Co-Speech 3d Gesture with Diffusion Model. Available at SSRN: https://ssrn.com/abstract=4818829 or http://dx.doi.org/10.2139/ssrn.4818829