Negative chemical data boosts language models in reaction outcome prediction

Alessandra Toniato; Alain C. Vaucher; Teodoro Laino; Mara Graziani

doi:10.1126/sciadv.adt5578

Science Advances

Paper

13 Jun 2025

Negative chemical data boosts language models in reaction outcome prediction

View publication

Abstract

Trial-and-error approaches in chemistry generate abundant unsuccessful experiments, yet the potential of these so-called negative results remains largely underutilized. Here, we demonstrate that information from negative chemical reactions can be leveraged to improve reactivity-prediction models, offering advantages in scenarios with a limited volume of successful data. We extend the tuning of language models with reinforcement learning to the chemistry domain, training a transformer model for chemical reaction prediction. Our approach is evaluated using both a rigorously controlled dataset and a realistic high-throughput dataset comprising extensive reaction screenings across diverse catalysts sets and experimental conditions. The model achieves state-of-the-art performance by leveraging information from as few as 20 positive data points in the controlled dataset, supported by a negative dataset at least 40 times larger. Consistent results on both datasets demonstrate that, with an appropriate optimization strategy and the inclusion of unsuccessful experimental data, models can be effectively trained even when successful reactions are underrepresented.

Workshop paper