C.A. Micchelli, W.L. Miranker
Journal of the ACM
Text-to-SQL aims to translate natural language queries from users into SQL statements executable over a database, which is highly practical as it enables anyone to easily retrieve the desired information from the database. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that particularly require grounding in the various database schemas, which makes the generated SQLs less accurate sometimes. To address this problem, we propose constructing a knowledge base for text-to-SQL -- a foundational source of common knowledge -- from which we retrieve and generate the necessary knowledge for given diverse queries. Due to this, our work has a different focus from existing work that either manually annotates knowledge or generates only a few pieces of knowledge for each query. In particular, our knowledge base is comprehensive and constructed based on a combination of all the available existing questions and their associated database schemas along with their relevant knowledge via LLM prompting, and can be effectively reused for unseen databases from different datasets. We experimentally validate our approach on benchmark text-to-SQL datasets, considering both overlapping and non-overlapping database scenarios, on which it outperforms relevant baselines substantially.