Experiments in high-dimensional text categorization
Fred J. Damerau, Tong Zhang, et al.
SIGIR Forum (ACM Special Interest Group on Information Retrieval)
This paper presents a novel online relevant set algorithm for a linearly scored block sequence translation model. The key component is a new procedure to directly optimize the global scoring function used by a statistical machine translation (SMT) decoder. This training procedure treats the decoder as a black-box, and thus can be used to optimize any decoding scheme. The novel algorithm is evaluated using different feature types: 1) commonly used probabilistic features, such as translation, language, or distortion model probabilities, and 2) binary features. In particular, encouraging results on a standard Arabic-English translation task are presented for a translation system that uses only binary feature functions. To further demonstrate the effectiveness of the novel training algorithm, a detailed comparison with the widely used minimum-error-rate (MER) training algorithm is presented using the same decoder and feature set. The online algorithm is simplified by introducing so-called "seed" block sequences which enable the training to be carried out without a gold standard block translation. While the online training algorithm is extremely fast, it also improves translation scores over the MER algorithm in some experiments. © 2008 IEEE.
Fred J. Damerau, Tong Zhang, et al.
SIGIR Forum (ACM Special Interest Group on Information Retrieval)
Alina Beygelzimer, Daniel Hsu, et al.
NeurIPS 2010
Christoph Tillmann, Sanjika Hewavitharana
INTERSPEECH 2011
Christoph Tillmann, Tong Zhang
ACM Transactions on Speech and Language Processing