Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Lakehouse systems enable the same data to be queried with multiple execution engines. However, selecting the engine best suited to run a SQL query still requires a priori knowledge of the query’s computational requirements and an engine’s capabilities, a complex and manual task that only becomes more difficult with the emergence of new engines and workloads. In this paper, we address this limitation by proposing a cross-engine optimizer that is able to automate engine selection for diverse SQL queries by means of a learned cost model. A query plan, optimized with hints, is used for query cost prediction and routing. Cost prediction is formulated as a multi-task learning problem and multiple predictor heads, corresponding to different engines and provisionings, are used in the model architecture. This effectively eliminates the need to train engine-specific models and allows the flexible addition of new engines at a minimal fine-tuning cost.