A Large Encoder-Decoder Polymer-Based Foundation Model
Abstract
Representation systems for polymers are a constant issue in deep-learning models for polymer property prediction, necessitating a balance between structural accuracy with interoperability to achieve utility in property prediction tasks. To facilitate this, we introduce a serialized polymer graph (SPG) notation and SPG-TED289M, a SPG-based foundation model for polymers, which has been pre-trained on a carefully curated dataset of 1 million SPG samples. To better handle the unique characteristics of SPG, we extended the tokenization process, resulting in a vocabulary of 2,407 distinct tokens. We evaluated the SPG-TED289M model's performance across a range of tasks including copolymer phase behavior, polymer membrane properties, multi-task learning, refractive index prediction, ionic conductivity, gas permeability, and glass transition temperature. The model demonstrated state-of-the-art performance in most of these areas, achieving results on par with specialized models designed for specific tasks. This indicates that SPG-TED289M, with minimal fine-tuning, can adapt effectively to complex polymer-related tasks, showcasing its robustness and versatility as a foundation model. The SPG-TED289M model provides significant flexibility and scalability, making it a valuable tool for various applications in polymer science.