vLLM in Confidential CPU-GPU Enclaves: Does it Perform?
Abstract
AI workloads like large language models (LLMs) have been widely deployed in the cloud systems due to resource management, scalablity, and cost efficiency. While there is significant amount of sensitive data running in the AI workloads, confidentiality protection for each tenant in the cloud becomes a required feature to defend against possible data leakage or eavesdropping. This is where Confidential Computing technology comes in to offer each tenant confidentiality and integrity protection by leveraging Trusted Execution Environments (TEEs) such as AMD SEV and Intel TDX. Recently, NVIDIA announced support for Confidential Computing in the Grace Hopper and Blackwell GPU generations. In this poster, we evaluate the overhead of confidential computing in a CPU-GPU TEE setup by running vLLM granite models. The results show that with the parallelization strategies integrated into vLLM the overhead is negligible.