Emog: Synthesizing Emotive Co-Speech 3d Gesture with Diffusion Model

26 Pages Posted: 7 May 2024

See all articles by Lianying Yin

Lianying Yin

Hunan University

Yijun Wang

Hunan University

Tianyu He

Microsoft Research Asia

Wei Zhao

Hunan University

Xin Jin

Eastern Institute of Technology

Jianxin Lin

Hunan University

Abstract

Co-speech 3D human gesture synthesis has received extensive attention, which aims to generate realistic 3D animations of human motion using audio signals as input. However, generating highly vivid motion remains a challenge due to the one-to-many nature between the speech content and gestures. This intricate association often leads to the oversight of certain gesture properties, notably the delicate differences of gesture emotion. Moreover, the complexity inherent in audio-driven 3D human gesture synthesis is compounded by the interaction of joint correlation and the temporal variation intrinsic to human gesture, often causing plain gesture generation. In this paper, we present a novel framework, dubbed \textbf{EM}otive \textbf{G}esture generation (EMoG), to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotional representations extracted from audio into the distribution modeling process of the diffusion model, explicitly increasing the generation diversity; 2) To enable the emotive gesture generation,  we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Temporal-dynamics and Joint-correlation aware Transformer (TJFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial advantages in gesture synthesis. Furthermore, our approach exhibits the additional capabilities of enabling emotion transfer and affording fine-grained style control over specific body parts. This remarkable advancement paves the way for expressive gesture generation with heightened fidelity, opening up new possibilities for immersive experiences in the realm of 3D human gesture synthesis.

Keywords: Co-speech gesture generation, Diffusion model

Suggested Citation

Yin, Lianying and Wang, Yijun and He, Tianyu and Zhao, Wei and Jin, Xin and Lin, Jianxin, Emog: Synthesizing Emotive Co-Speech 3d Gesture with Diffusion Model. Available at SSRN: https://ssrn.com/abstract=4818829 or http://dx.doi.org/10.2139/ssrn.4818829

Lianying Yin

Hunan University ( email )

2 Lushan South Rd
Changsha, CA 410082
China

Yijun Wang

Hunan University ( email )

2 Lushan South Rd
Changsha, CA 410082
China

Tianyu He

Microsoft Research Asia ( email )

Beijing
China

Wei Zhao

Hunan University ( email )

2 Lushan South Rd
Changsha, CA 410082
China

Xin Jin

Eastern Institute of Technology ( email )

501 Gloucester Street
Taradale
Napier, 4142
New Zealand

Jianxin Lin (Contact Author)

Hunan University ( email )

2 Lushan South Rd
Changsha, CA 410082
China

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
15
Abstract Views
88
PlumX Metrics