DIA-TTS: Deep-Inherited Attention Based Text-to-Speech Synthesizer
15 Pages Posted: 25 Oct 2022 Publication Status: Under Review
Abstract
Text-to-speech (TTS) synthesizer has been widely used as a vital assistive tool in various fields. Tradition sequence to sequence (seq2seq) TTS like Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which appears the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate run-on and wrong break sentences regardless of punctuation marks, which makes the synthesized waveform lack emotions and unnaturally sound. In this paper, we proposed an end-to-end neural generative TTS model that is based on deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token-frame correlation as well as fastens the alignment process. Also, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contribute to outperformance in predicting phrase breaks of the synthesized speeches with high quality. We used WaveGlow as a vocoder for real-time human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for phrase breaking and attention robustness enhancement.
Keywords: Natural Language Processing, Text to Speech, Deep Learning, Deep Neural Network, Local Sensitive Attention
Suggested Citation: Suggested Citation