Paper

HTA - An open-source software for assigning head and tail positions to monomer SMILES in polymerization reactions

Abstract

Abstract: Artificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups based on the atomic index of nucleophilicity. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm correctly predicted the polymer class of 204 monomer SMILES, achieving an accuracy of 99%. The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of 91%. The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign. Scientific contribution: The algorithm was successfully applied to data pre-processing by tagging the linkage bonds in monomers for defining the repeat units in polymerization reactions.