Ensembles of multi-scale VGG acoustic models
Michael Heck, Masayuki Suzuki, et al.
INTERSPEECH 2017
The proper segmentation of an input text string into meaningful intonational phrase units is a fundamental task in the text-processing component of a text-to-speech (TTS) system that generates intelligible and natural synthesis. In this work we look at the creation of a symbolic, phrase-assignment model within the front end (FE) of a North American English TTS system when high-quality labels for supervised learning are unavailable and/or potentially mismatched to the target corpus and domain. We explore a labeling scheme that merges heuristics derived from (i) automatic high-quality phonetic alignments, (ii) linguistic rules, and (iii) a legacy acoustic phrase-labeling system to arrive at a ground truth that can be used to train a bidirectional recurrent neural network model. We evaluate the performance of this model in terms of objective metrics describing categorical phrase assignment within the FE proper, as well as on the effect that these intermediate labels carry onto the TTS back end for the task of continuous prosody prediction (i.e., intonation and duration contours, and pausing). For this second task, we rely on subjective listening tests and demonstrate that the proposed system significantly outperforms a linguistic rules-based baseline for two different synthetic voices.
Michael Heck, Masayuki Suzuki, et al.
INTERSPEECH 2017
Asaf Rendel, Raul Fernandez, et al.
ICASSP 2016
Andrew Rosenberg, Raul Fernandez, et al.
ICASSP 2018
Raul Fernandez, Asaf Rendel, et al.
ICASSP 2013