GraphQL Query Generation: A Large Training and Benchmarking Dataset
Abstract
GraphQL is a powerful query language for APIs, designed to enable clients to fetch exactly the data they need in an efficient and flexible manner. It allows for querying multiple resources using a single request, seamlessly integrating data from various sources such as APIs and databases. However, creating GraphQL queries can be challenging, particularly for complex industry use cases. A lucrative alternative is to use contemporary Large Language Models (LLMs) to generate GraphQL query operation from natural language query. However, due to the limited availability of publicly accessible GraphQL schemas, these LLMs have not been sufficiently exposed to GraphQL data during their training. Consequently, they often struggle to generate valid and optimized GraphQL query operations. To address this issue, we present a large-scale, complex, cross-domain, and cross-source text-to-GraphQL query operation dataset. The dataset includes 10,940 training triples from 185 cross-source data stores and 954 test triples from 14 cross-source data stores. Each triple consists of a GraphQL schema, a GraphQL query operation, and the corresponding natural language query. This dataset is manually generated (with NL paraphrasing) and then manually validated by spending a total effort of around 1200 person-hours. We evaluated 9 state-of-the-art LLMs on our test dataset. The models achieved only 0-10% accuracy in the zero-shot setting. With few-shot examples, some models improved to 30-40% accuracy. These results highlight the need of such complex GraphQL dataset creation which later can be used for model fine-tuning or prompt tuning. Our dataset will be publicly released under the MIT License.