Low-Resource Speech Recognition of 500-Word Vocabularies
Sabine Deligne, Ellen Eide, et al.
INTERSPEECH - Eurospeech 2001
Conventional methods for training statistical models for automatic speech recognition, such as acoustic and language models, have focused on criteria such as maximum likelihood and sentence or word error rate (WER). However, unlike dictation systems, the goal for spoken dialogue systems is to understand the meaning of what a person says, not to get every word correctly transcribed. For such systems, we propose to optimize the statistical models under end-to-end system performance criteria. We illustrate this principle by focusing on the estimation of the language model (LM) component of a natural language call routing system. This estimation, carried out under a conditional maximum likelihood objective, aims at optimizing the call routing (classification) accuracy, which is often the criterion of interest in these systems. LM updates are derived using the extended Baum-Welch procedure of Gopalakrishnan et.al. In our experiments, we find that our estimation procedure leads to a small but promising gain in classification accuracy. Interestingly, the estimated language models also lead to an increase in the word error rate while improving the classification accuracy, showing that the system with the best classification accuracy is not necessarily the one with the lowest WER. Significantly, our LM estimation procedure does not require the correct transcription of the training data, and can therefore be applied to unsupervised learning from un-transcribed speech data. © 2005 IEEE.
Sabine Deligne, Ellen Eide, et al.
INTERSPEECH - Eurospeech 2001
Hagen Soltau, George Saon, et al.
IEEE Transactions on Audio, Speech and Language Processing
Youssef Mroueh, Etienne Marcheret, et al.
AISTATS 2017
Sabine Deligne
ICSLP 2000