DIA-TTS: Deep-Inherited Attention Based Text-to-Speech Synthesizer

Yu, Junxiao; Xu, Zhengyuan; He, Xu; Wang, Jian; Liu, Bin; Feng, Rui; Zhu, Songsheng; Wang, Wei; Li, Jianqing

doi:10.2139/ssrn.4257520

Download This Paper

Open PDF in Browser

Add Paper to My Library

DIA-TTS: Deep-Inherited Attention Based Text-to-Speech Synthesizer

Heliyon

15 Pages Posted: 25 Oct 2022 Publication Status: Under Review

See all articles by Junxiao Yu

Text-to-speech (TTS) synthesizer has been widely used as a vital assistive tool in various fields. Tradition sequence to sequence (seq2seq) TTS like Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which appears the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate run-on and wrong break sentences regardless of punctuation marks, which makes the synthesized waveform lack emotions and unnaturally sound. In this paper, we proposed an end-to-end neural generative TTS model that is based on deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token-frame correlation as well as fastens the alignment process. Also, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contribute to outperformance in predicting phrase breaks of the synthesized speeches with high quality. We used WaveGlow as a vocoder for real-time human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for phrase breaking and attention robustness enhancement.

Keywords: Natural Language Processing, Text to Speech, Deep Learning, Deep Neural Network, Local Sensitive Attention

Suggested Citation: Suggested Citation

Yu, Junxiao and Xu, Zhengyuan and He, Xu and Wang, Jian and Liu, Bin and Feng, Rui and Zhu, Songsheng and Wang, Wei and Li, Jianqing, DIA-TTS: Deep-Inherited Attention Based Text-to-Speech Synthesizer. Available at SSRN: https://ssrn.com/abstract=4257520 or http://dx.doi.org/10.2139/ssrn.4257520