Poster

Triton vs. Halide: Exploring Coupled and Decoupled Machine Learning Kernel Languages

Abstract

Machine-learning (ML) workloads heavily rely on tensor kernels to implement high-performance linear-algebra operations. Each tensor kernel consists of two orthogonal components: computation, which specifies the mathematical operation, and schedule, which determines how the operation is executed on hardware. Achieving high performance critically depends on selecting an efficient schedule. Existing domain-specific languages (DSLs) for tensor programming adopt differing design philosophies: Halide decouples computation from scheduling, enabling flexibility but requiring expert tuning, whereas Triton tightly integrates computation and schedule, relying on compiler optimizations to generate efficient code.

This work introduces Decoupled Triton (DT), a new DSL for writing GPU tensor kernels that combines the strengths of both approaches. DT provides a decoupled programming model similar to Halide, while targeting the Triton compilation ecosystem to leverage its backend performance optimizations. DT acts as a high-level abstraction layer above Triton, enabling separation of algorithm specification from scheduling decisions without sacrificing performance portability. We present the design of the DT language, its scheduling model, and a prototype compiler implementation. Experimental results demonstrate that DT achieves competitive performance relative to hand-written Triton and PyTorch implementations while significantly improving programmability and modularity. DT enables systematic exploration of scheduling strategies and provides a foundation for more productive and maintainable tensor-kernel development.