Workshop paper

A General Pipeline for LLM Finetune Ingestion of Scientific Tabular Data

Abstract

Large Language Models (LLMs) excel in linguistic reasoning but remain limited in processing structured scientific data such as tables and measurement datasets. We introduce a scalable framework that converts tabular scientific data into validated natural-language question-answer (Q&A) corpora for LLM fine-tuning. The pipeline integrates statistical quantization, automated Q&A generation, linguistic refinement, and LLM-as-a-judge evaluation to ensure factual and linguistic quality. Applied to the QM9, QMOF, and PubChem datasets, it produced over 1.3 billion tokens across 12.5 million samples with high fluency and grammatical accuracy. This data-to-text paradigm bridges numerical and linguistic modalities, enabling LLMs to reason over empirical data and advancing the development of scientifically grounded, multimodal language models. All resulting corpora will be open-sourced.