Teng Xiao, Huaisheng Zhu, et al.
ICML 2024
Undesired toxicity is a major hindrance in drug discovery and largely responsible for high attrition rates in the early stages. This calls for new, reliable, and interpretable molecular property prediction models that help to prioritize compounds and thus reduce the high costs for development and the risk to humans, animals, and the environment. Here, we propose ToxSmi, an interpretable chemical language model that combines self-attention with multiscale convolutions and relies on data augmentation. We first benchmark various molecular representations (e.g., fingerprints, different flavors of SMILES and SELFIES, as well as graph and graph kernel methods) revealing that SMILES coupled with augmentation overall yields the best performance. Despite its sim-plicity, ToxSmi is then shown to outperform existing approaches across a wide range of molecular property prediction tasks, including but not limited to toxicity. Moreover, the attention weights of ToxSmi allow for easy interpretation and show enrichment of known toxicophores even without explicit supervision. To introduce a notion of model reliability, we propose and combine two simple methods for uncertainty estimation (Monte-Carlo dropout and test-time-augmentation). These methods not only identify samples with high prediction uncertainty, but also allow forming implicit model ensembles that improve accuracy. Last, we validate ToxSmi on a large-scale proprietary toxicity dataset and find that it outperforms previous work while giving similar insights into revealing cytotoxic substructures.
Teng Xiao, Huaisheng Zhu, et al.
ICML 2024
Alain Vaucher, Philippe Schwaller, et al.
AMLD EPFL 2022
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Bo Zhao, Nima Dehmamy, et al.
ICML 2025