Osamu Ichikawa, Takashi Fukuda, et al.
IEEE JSTSP
Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. We have proposed a statistical-model-based VAD using the long-term temporal information in speech, which shows good robustness against noise in an automobile environment. For further improvement, this paper describes a new method to exploit harmonic structure information with statistical models. In our approach, local peaks considered to be harmonic structures are extracted, without explicit pitch detection and voiced-unvoiced classification. The proposed method including both long-term temporal and static harmonic features led to considerable improvements under low SNR conditions in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system. ©2010 IEEE.
Osamu Ichikawa, Takashi Fukuda, et al.
IEEE JSTSP
Takashi Fukuda, Samuel Thomas
INTERSPEECH 2021
Takashi Fukuda, Samuel Thomas
INTERSPEECH 2020
Tara N. Sainath, Avishy Carmi, et al.
ICASSP 2010