Saurabh Paul, Christos Boutsidis, et al.
JMLR
Large Language Models have transformed cloud computing, but their deployment presents a challenging trilemma between operational costs, energy consumption, and performance requirements. This keynote presents a novel open architecture that harmonizes multiple efficiency techniques to address these competing concerns. We examine critical optimization strategies including quantization, batching strategies, KV-caching, auto-scaling, model parallelisms, and specialized hardware accelerators—analyzing their individual strengths and compounding benefits when integrated as a cohesive system.