Debasis Ganguly, Martin Gleize, et al.
AMIA ... Annual Symposium proceedings. AMIA Symposium
A standard word embedding algorithm, such as ‘word2vec’, embeds each word as a dense vector of a preset dimensionality, the components of which are learned by maximizing the likelihood of predicting the context around it. However, as an inherent linguistic phenomenon, it is evident that there is a varying degree of difficulty in identifying words from their contexts. This suggests that a variable granularity in word vector representation may be useful to obtain sparser and more compressed word representations, requiring less storage space. To that end, in this paper, we propose a word vector training algorithm that uses a variable number of components to represent words. Given a text collection of documents, our algorithm, similar to the skip-gram approach of word2vec, learns to predict the context of a word given the current instance of a word. However, in contrast to skip-gram, which uses a static number of dimensions for each word vector, we propose to dynamically increase the dimensionality as a stochastic function of the prediction error. Our experiments with standard test collections demonstrate that our word representation method is able to achieve comparable (and sometimes even better) effectiveness than skip-gram word2vec, using a significantly smaller number of parameters (achieving compression ratio of around 65%).
Debasis Ganguly, Martin Gleize, et al.
AMIA ... Annual Symposium proceedings. AMIA Symposium
Dwaipayan Roy, Mandar Mitra, et al.
JDIQ
Anirban Chakraborty, Debasis Ganguly, et al.
ICTIR 2019
Manisha Verma, Debasis Ganguly
SIGIR 2019