Whole Sentence Neural Language Models
Abstract
Recurrent neural networks have become increasingly popular for the task of language modeling achieving impressive gains in state-of-the-art speech recognition and natural language processing (NLP) tasks. Recurrent models exploit word dependencies over a much longer context window (as retained by the history states) than what is feasible with n-gram language models. However the training criterion of choice for recurrent language models continues to be the local conditional likelihood of generating the current word given the (pos-sibly long) word context, thus making local decisions at each word. This locally-conditional design fundamentally limits the ability of the model in exploiting whole sentence structures. In this paper, we present our initial results at whole sentence neural language models which assign a probability to the entire word sequence. We extend the previous work on whole sentence maximum entropy models to recurrent language models while using Noise Contrastive Estimation (NCE) for training, as these sentence models are fundamentally un-normalizable. We present results on a range of tasks: from sequence identification tasks such as, palindrome detection to large vocabulary automatic speech recognition (LVCSR) and demonstrate the modeling power of this approach.