Efficient, low latency adaptation for speech recognition
Abstract
Constrained or feature space Maximum Likelihood Linear Regression (FMLLR) is known to be an effective algorithm for adaptation to a new speaker or environment. It employs a single transformation matrix and bias vector to linearly transform the test speaker's features. FMLLR makes no assumption on the underlying noise, environment or speaker and estimates parameters to maximize likelihood of the test data. The standard implementation needs considerable computational power, requires significant amounts of storage, and requires a first pass decoding before adaptation can begin. In this paper, we propose a simplified implementation of FMLLR for embedded applications to address these problems. Here, we employ a simple speech/silence segmentation to estimate parameters. We operate in the 13 dimensional cepstral space, hence resource requirements are low. The algorithm does not require a first pass decoding (parameter estimation is accomplished entirely in the front end) and can be applied with low latency as compared to FMLLR. The algorithms we describe here provide an attractive tradeoff between the power of FMLLR and the computational simplicity of Cepstral Mean Subtraction. With minimal cost, we achieve nearly 15% relative gains on an embedded speech recognition task. © 2007 IEEE.