header

DIA-TTS: Deep-Inherited Attention Based Text-to-Speech Synthesizer

15 Pages Posted: 25 Oct 2022 Publication Status: Under Review

See all articles by Junxiao Yu

Junxiao Yu

Nanjing Medical University

Zhengyuan Xu

Nanjing Medical University

Xu He

Nanjing Medical University

Jian Wang

Nanjing Medical University

Bin Liu

Nanjing Medical University

Rui Feng

Nanjing Medical University

Songsheng Zhu

Nanjing Medical University

Wei Wang

Nanjing Medical University

Jianqing Li

Nanjing Medical University

Abstract

Text-to-speech (TTS) synthesizer has been widely used as a vital assistive tool in various fields.  Tradition sequence to sequence (seq2seq) TTS like Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which appears the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate run-on and wrong break sentences regardless of punctuation marks, which makes the synthesized waveform lack emotions and unnaturally sound. In this paper, we proposed an end-to-end neural generative TTS model that is based on deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token-frame correlation as well as fastens the alignment process. Also, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contribute to outperformance in predicting phrase breaks of the synthesized speeches with high quality. We used WaveGlow as a vocoder for real-time human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for phrase breaking and attention robustness enhancement.

Keywords: Natural Language Processing, Text to Speech, Deep Learning, Deep Neural Network, Local Sensitive Attention

Suggested Citation

Yu, Junxiao and Xu, Zhengyuan and He, Xu and Wang, Jian and Liu, Bin and Feng, Rui and Zhu, Songsheng and Wang, Wei and Li, Jianqing, DIA-TTS: Deep-Inherited Attention Based Text-to-Speech Synthesizer. Available at SSRN: https://ssrn.com/abstract=4257520 or http://dx.doi.org/10.2139/ssrn.4257520

Junxiao Yu

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Zhengyuan Xu

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Xu He

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Jian Wang

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Bin Liu

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Rui Feng

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Songsheng Zhu

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Wei Wang

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Jianqing Li (Contact Author)

Nanjing Medical University ( email )

300 Guangzhou Road
Nanjing, 210029
China

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
41
Abstract Views
569
PlumX Metrics