Confidential Computing for Scaling Inference Workloads

Julian James Stephen

OSSNA 2025

Invited talk

23 Jun 2025

Confidential Computing for Scaling Inference Workloads

Abstract

The 'pay-as-you-go' cloud service model has been extremely successful for on-demand application scaling and saving costs. Applications running on confidential Trusted Execution Environments (TEEs), on the other hand, face significant challenges that prohibit them from leveraging the same advantages. Seamlessly adding nodes while ensuring correct attestation semantics, moving encrypted application state efficiently to new nodes, adding confidential GPU capacity, all require substantial administrative and performance overheads.

In this talk we will discuss the state of, and future directions for, the underlying systems that scale distributed applications on confidential clusters. More specifically, we will focus on enabling confidential computing for distributed AI inference workloads on VLLM, using Ray clusters. We will cover limitations and optimizations for this use case in TEEs with confidential GPUs (Nvidia H100). Confidential computing environments also create new scheduling tradeoffs for AI inference services. For example, model load times are significantly higher in confidential TEEs than in regular VMs, affecting inference throughput and latency. We will present empirical data exploring these different tradeoffs and discuss best practices for common use cases.

Paper