Unified lookup tables: training foundation models on encoded data

Nikita Janakarajan; Irina Espejo Morales; Marvin Alberts; Andrea Giovannini; Matteo Manica; Antonio Foncubierta-Rodríguez

doi:10.1088/2632-2153/ae143c

Machine Learning Science and Technology

Paper

28 Oct 2025

Unified lookup tables: training foundation models on encoded data

View publication

Abstract

Transformers have proven successful in a range of sequence modelling tasks. However, these models have significant limitations: they are inherently data-greedy, and suffer from the risk of training data leakage. These limitations prevent their broad application in various domains. While the advent of foundation models (FMs) addresses the data-greedy nature of Transformers, the risk of exposing training data remains; it has been demonstrated that excerpts of the training data can be obtained by prompt engineering on an FM. To simultaneously address these limitations, we propose unified lookup tables (ULTs), a data preprocessing step that enables building and fine-tuning FMs on encoded data. ULTs enable the reuse of a trained model on new datasets without exposing any unencoded training data. The method leverages data compression methods as efficient modality tokenizers, and a common representation vocabulary to facilitate fine-tuning on encoded data. We theoretically support our claims through numerical estimations of the likelihood of reverse engineering the data encoding and practically through empirical evaluation on domains that can benefit from ULTs. Specifically, we evaluate the impact of using ULTs as a preprocessing step before training both decoder-only and encoder–decoder language models on text, images, and molecules. We demonstrate that the encoding step does not negatively affect model training and leads to an average relative increase of ∼16% on a collection of text metrics, while producing close to competitive results on image classification and chemical reaction prediction tasks.

Paper