Guojing Cong, David A. Bader
Journal of Parallel and Distributed Computing
Transformers have proven successful in a range of sequence modelling tasks. However, these models have significant limitations: they are inherently data-greedy, and suffer from the risk of training data leakage. These limitations prevent their broad application in various domains. While the advent of foundation models (FMs) addresses the data-greedy nature of Transformers, the risk of exposing training data remains; it has been demonstrated that excerpts of the training data can be obtained by prompt engineering on an FM. To simultaneously address these limitations, we propose unified lookup tables (ULTs), a data preprocessing step that enables building and fine-tuning FMs on encoded data. ULTs enable the reuse of a trained model on new datasets without exposing any unencoded training data. The method leverages data compression methods as efficient modality tokenizers, and a common representation vocabulary to facilitate fine-tuning on encoded data. We theoretically support our claims through numerical estimations of the likelihood of reverse engineering the data encoding and practically through empirical evaluation on domains that can benefit from ULTs. Specifically, we evaluate the impact of using ULTs as a preprocessing step before training both decoder-only and encoder–decoder language models on text, images, and molecules. We demonstrate that the encoding step does not negatively affect model training and leads to an average relative increase of ∼16% on a collection of text metrics, while producing close to competitive results on image classification and chemical reaction prediction tasks.
Guojing Cong, David A. Bader
Journal of Parallel and Distributed Computing
John R. Kender, Rick Kjeldsen
IEEE Transactions on Pattern Analysis and Machine Intelligence
Giuseppe Romano, Aakrati Jain, et al.
ECTC 2025
Merve Unuvar, Yurdaer Doganata, et al.
CLOUD 2014