Nobuyasu Itoh, Gakuto Kurata, et al.
INTERSPEECH 2015
Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system. © 2010 IEEE.
Nobuyasu Itoh, Gakuto Kurata, et al.
INTERSPEECH 2015
Takashi Fukuda, Ryuki Tachibana, et al.
ICASSP 2012
Tohru Nagano, Ryuki Tachibana, et al.
ICASSP 2008
Takashi Fukuda, Osamu Ichikawa, et al.
ICASSP 2010