Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement
Abstract
We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43% higher than SVD on average. Our algorithm creates basis vectors for a reduced space by iteratively `scaling' vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation.