Unfolded recurrent neural networks for speech recognition
George Saon, Hagen Soltau, et al.
INTERSPEECH 2014
Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speech-synthesis system. However, prosody prediction can be affected by an interaction of short- And long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests.
George Saon, Hagen Soltau, et al.
INTERSPEECH 2014
Sören Bleikertz, Carsten Vogel, et al.
ACSAC 2014
Bogdan Prisacari, German Rodriguez, et al.
INA-OCMC 2014
Jean M.R. Costa, Marcelo Cataldo, et al.
CHI 2011