Topic-based language models using dirichlet mixtures
Abstract
We propose a generative text model using Dirichlet Mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya Mixtures, and show that the model exhibits high performance in application to statistical language models. In this paper, we discuss some methods for estimating parameters of Dirichlet Mixtures and for estimating the expectation values of the a posteriori distribution needed for adaptation, and then compare them with two previous text models. The first conventional model is the Mixture of Unigrams, which is often used for incorporating topics into statistical language models. The second one is LDA (Latent Dirichlet Allocation), a typical generative text model. In an experiment using document probabilities and dynamic adaptation of n-gram models for newspaper articles, we show that the proposed model, in comparison with the two previous models, can achieve a lower perplexity at low mixture © 2007 Wiley Periodicals, Inc.