Conference paper

Robust Evaluation of LLM-Generated GraphQL Queries for Web Services

Abstract

GraphQL offers a flexible alternative to REST APIs, enabling precise data retrieval across multiple services—a critical requirement in today's service-oriented architectures. However, constructing complex GraphQL queries remains challenging, and even Large Language Models (LLMs) often generate suboptimal queries due to limited schema awareness. Recent advancements, such as specialized prompt engineering, schema-aware in-context learning, and dedicated datasets, have aimed to improve query generation. However, evaluating the quality of generated queries remains challenging: GraphQL's inherent flexibility allows semantically equivalent queries to differ syntactically, complicating both automatic and manual evaluation. In this work, we introduce Robust GraphQL Evaluation (RGEval), the first benchmarking pipeline designed to systematically assess the quality of LLM-generated GraphQL queries. RGEval effectively handles schema complexities and structural variations, ensuring accurate evaluations while significantly improving efficiency—reducing evaluation time from hours to minutes. With Gartner projecting that over 60 % of enterprises will adopt GraphQL in production by 2027, RGEval provides a critical solution for benchmarking LLM-generated queries, fostering trust in AI-driven web service consumption.