Emog: Synthesizing Emotive Co-Speech 3d Gesture with Diffusion Model
26 Pages Posted: 7 May 2024
Abstract
Co-speech 3D human gesture synthesis has received extensive attention, which aims to generate realistic 3D animations of human motion using audio signals as input. However, generating highly vivid motion remains a challenge due to the one-to-many nature between the speech content and gestures. This intricate association often leads to the oversight of certain gesture properties, notably the delicate differences of gesture emotion. Moreover, the complexity inherent in audio-driven 3D human gesture synthesis is compounded by the interaction of joint correlation and the temporal variation intrinsic to human gesture, often causing plain gesture generation. In this paper, we present a novel framework, dubbed \textbf{EM}otive \textbf{G}esture generation (EMoG), to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotional representations extracted from audio into the distribution modeling process of the diffusion model, explicitly increasing the generation diversity; 2) To enable the emotive gesture generation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Temporal-dynamics and Joint-correlation aware Transformer (TJFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial advantages in gesture synthesis. Furthermore, our approach exhibits the additional capabilities of enabling emotion transfer and affording fine-grained style control over specific body parts. This remarkable advancement paves the way for expressive gesture generation with heightened fidelity, opening up new possibilities for immersive experiences in the realm of 3D human gesture synthesis.
Keywords: Co-speech gesture generation, Diffusion model
Suggested Citation: Suggested Citation