Om D. Deshmukh, Shajith Ikbal, et al.
INTERSPEECH 2011
As the target of Automatic Speech Recognition (ASR) has moved from clean read speech to spontaneous conversational speech, we need to prepare orthographic transcripts of spontaneous conversational speech to train acoustic models (AMs). However, it is expensive and slow to manually transcribe such speech word by word. We propose a framework to train an AM based on easy-to-make rough transcripts in which fillers and small word fragments are not precisely transcribed and some transcription errors are included. By focusing on the phone duration in the result of forced alignment between the rough transcripts and the utterances, we can automatically detect the erroneous parts in the rough transcripts. A preliminary experiment showed that we can detect the erroneous parts with moderately high recall and precision. Through ASR experiments with conversational telephone speech, we confirmed that automatic detection helped improve the performance of the AM trained with both conventional ML criteria and state-of-the-art boosted MMI criteria. Copyright © 2011 ISCA.
Om D. Deshmukh, Shajith Ikbal, et al.
INTERSPEECH 2011
Christine Robson, Sean Kandel, et al.
CHI 2011
Vikram Gupta, Jitendra Ajmera, et al.
INTERSPEECH 2011
Christoph Tillmann, Sanjika Hewavitharana
INTERSPEECH 2011