Workshop paper

Towards Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models

Abstract

Molecular Property Prediction (MPP) is a central task in the Drug Discovery pipeline, which has recently attracted significant attention. Large Language Models (LLMs), known for their impressive proficiency across domains, show promise as generalist models for MPP. However, their current performance remains below the threshold needed for practical adoption. To bridge this gap, we propose a novel method for boosting LLMs by distilling knowledge from tree-based specialist models, thereby complementing their internal knowledge and improving predictive accuracy. Particularly, we first train Random Forest specialist models, each composed of many decision trees, on the features of 40K functional groups extracted from the input molecule. Then, we obtain and verbalize learned predictive rules from these decision trees and randomly select one to incorporate into each prompt during training LLMs. Lastly, we introduce a test-time scaling technique for boosting LLMs further, namely rule-consistency, which is demonstrated to be more effective than the standard self-consistency. Extensive experimental results on the TDC benchmark with 9 datasets using Gemma-2-2B and Granite-3.3-2B show that our method substantially enhances the performance of LLMs on MPP, making progress toward building generalist models for this task.