Extreme Value Monte Carlo Tree Search for Classical Planning
Masataro Asai, Stephen Wissow
AAAI 2026
Large Language Models (LLMs) excel in linguistic reasoning but remain limited in processing structured scientific data such as tables and measurement datasets. We introduce a scalable framework that converts tabular scientific data into validated natural-language question-answer (Q&A) corpora for LLM fine-tuning. The pipeline integrates statistical quantization, automated Q&A generation, linguistic refinement, and LLM-as-a-judge evaluation to ensure factual and linguistic quality. Applied to the QM9, QMOF, and PubChem datasets, it produced over 1.3 billion tokens across 12.5 million samples with high fluency and grammatical accuracy. This data-to-text paradigm bridges numerical and linguistic modalities, enabling LLMs to reason over empirical data and advancing the development of scientifically grounded, multimodal language models. All resulting corpora will be open-sourced.
Masataro Asai, Stephen Wissow
AAAI 2026
Nathaniel Park, Tim Erdmann, et al.
Polycondensation 2024
Paula Olaya, Sophia Wen, et al.
Big Data 2024
Gang Liu, Michael Sun, et al.
ICLR 2025