A Dynamic Multi-Modal Fusion Model for Material Discovery
Abstract
Recent advancements in Artificial Intelligence (AI) and Machine Learning (ML) have created vast opportunities in the field of material discovery, with models trained across various data forms or modalities such as SMILES, SELFIES, molecular graphs, spectrum, properties, etc. spanning across different domains (such as polymers, drugs, crystals). Though these unimodal models are capable of effectively capturing the representations of their respective data modalities or domains, it is further possible for models to gain a more comprehensive understanding of materials from representations learnt from different modalities. Multimodal models learn to integrate and process information from diverse sources, thus enhancing model robustness and providing deeper insights compared to unimodal models. By leveraging insights from each modality, multimodal models have significantly higher representation power by uncovering patterns that may remain hidden in unimodal models. Previous attempts at multimodal fusion methods often combined unimodal models through basic concatenation or simple strategies, which rely on paired representations and may overlook challenges due to data scarcity or missing modalities. In this work, we propose a dynamic multimodal fusion model that efficiently combines unimodal representations, adapting dynamically to capture a comprehensive representation as needed. The core objective of our proposed dynamic multimodal fusion model is to elevate both the robustness and performance of the multimodal model by adaptively tailoring the fusion process to the inputs from distinct unimodal models. The key benefits of our proposed approach include: 1. Dynamic Selection: It allows for the dynamic selection of unimodal inputs that are most likely to enhance the performance of the fused model, effectively filtering out noise or less impactful input modalities. 2. Handling Missing Modalities: Our method adeptly manages scenarios where paired data for different modalities is scarce or unavailable. To illustrate our method, we demonstrate its efficacy in combining three modalities—namely SMILES, SELFIES, and Molecular Graphs— and benchmark its performance against conventional fusion techniques such as simple concatenation. Our findings reveal that the representation generated through our proposed dynamic fusion strategy significantly surpasses the outcomes achieved by traditional fusion methods on various downstream prediction tasks. This research presents a flexible and revolutionary way to combine representations from various modalities, paving the way for a more profound comprehension of materials and their properties.