Balancing the Trilemma: A Unified Approach for Cost-Effective, Energy-Efficient and Performant LLM Serving

Tamar Eilam

SERVICES 2025

Keynote

07 Jul 2025

Balancing the Trilemma: A Unified Approach for Cost-Effective, Energy-Efficient and Performant LLM Serving

Abstract

Large Language Models have transformed cloud computing, but their deployment presents a challenging trilemma between operational costs, energy consumption, and performance requirements. This keynote presents a novel open architecture that harmonizes multiple efficiency techniques to address these competing concerns. We examine critical optimization strategies including quantization, batching strategies, KV-caching, auto-scaling, model parallelisms, and specialized hardware accelerators—analyzing their individual strengths and compounding benefits when integrated as a cohesive system.

Workshop paper