Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
Machine-learning (ML) workloads heavily rely on tensor kernels to implement high-performance linear-algebra operations. Each tensor kernel consists of two orthogonal components: computation, which specifies the mathematical operation, and schedule, which determines how the operation is executed on hardware. Achieving high performance critically depends on selecting an efficient schedule. Existing domain-specific languages (DSLs) for tensor programming adopt differing design philosophies: Halide decouples computation from scheduling, enabling flexibility but requiring expert tuning, whereas Triton tightly integrates computation and schedule, relying on compiler optimizations to generate efficient code.
This work introduces Decoupled Triton (DT), a new DSL for writing GPU tensor kernels that combines the strengths of both approaches. DT provides a decoupled programming model similar to Halide, while targeting the Triton compilation ecosystem to leverage its backend performance optimizations. DT acts as a high-level abstraction layer above Triton, enabling separation of algorithm specification from scheduling decisions without sacrificing performance portability. We present the design of the DT language, its scheduling model, and a prototype compiler implementation. Experimental results demonstrate that DT achieves competitive performance relative to hand-written Triton and PyTorch implementations while significantly improving programmability and modularity. DT enables systematic exploration of scheduling strategies and provides a foundation for more productive and maintainable tensor-kernel development.
Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
Raymond Wu, Jie Lu
ITA Conference 2007
Pradip Bose
VTS 1998
Ehud Altman, Kenneth R. Brown, et al.
PRX Quantum