Technical note
5 minute read

Accelerating AI inference with IBM Storage Scale

When you think of processing the inputs and outputs of an AI task, GPUs are probably the main infrastructure components that come to mind. But they are just one piece of the puzzle. Without the right network and storage infrastructure, today’s AI applications would be slow and prohibitively expensive, even with the most sophisticated GPUs. This has been true for AI training for many years, and today it is just as true for AI inference.

Large language model (LLM) based inference is a dominant workload for modern AI applications, and it has become so taxing on all three infrastructure resources (compute, network, and storage) that entirely new software capabilities are emerging to support resource management and optimization for inference. Notable examples include llm-d, vLLM, and Dynamo. Without efficient management and reuse of pre-computed inference artifacts, you’d be both overloading your GPUs with redundant re-computation and waiting a long time for your LLM to respond to each query.

Why is LLM inference so resource intensive?

Most modern LLMs are based on the transformer architecture with a self-attention mechanism. During inference processing, this architecture generates large quantities of intermediate runtime data in the form of large key (K) and value (V) tensors. For each ordered set of input tokens, an LLM generates multiple K and V tensors and keeps them in GPU memory as a form of cache, as long as there is space. These tensors are then used by the model to generate the next output token. To the extent that K and V values can be reused, they don’t need to be computed again, saving significant time and accelerating the generation of tokens. As new inputs (or requests) start to come in, however, GPU memory fills up and old values need to be discarded to make room for the new ones, meaning the ability to reuse previously computed KV data is limited by how much of it can be kept handy for the LLM inference service.

To provide an intuition for how unwieldy this data can get, our experiments show that the size of KV cache for 128K input tokens (a commonly supported context window size) for a popular Llama3-70B model is about 40GB, and the time-to-first-token (TTFT) is about 19 seconds when using four NVIDIA H100 GPUs. If you limit yourself to storing this data just in GPU memory, you very quickly run out of space. Addressing this problem was part of the motivation for new projects like llm-d and Dynamo, which initially focused — among other things — on adding a new capacity tier to the GPU memory with CPU RAM and combining it with KV cache aware routing. Of course, even CPU RAM quickly fills up with the data rates and sizes mentioned above. Moreover, when it comes to global optimization across a fleet of inference servers, the ability to persist data and share KV values across inference instances becomes increasingly valuable. That’s why we believe that the right solution involves a high-performance storage tier (high bandwidth and low latency enhanced with cache and locality awareness), which can offer virtually infinite capacity for KV cache reuse, while simultaneously delivering reduced time-to-first-token, high availability, and persistence of this important data type for AI inference.

AI inference acceleration, powered by IBM Storage Scale

IBM Storage Scale’s design makes it ideal for providing persistent space (from a few terabytes to hundreds of petabytes) for KV data, storing it for a virtually unlimited number of active users and agent sessions. Adding a Scale tier to an AI inference solution enables KV cache sharing between hundreds or even thousands of GPU servers. KV caches generated on one GPU can be immediately utilized by any other GPU in the whole cluster, thus reducing overall redundant calculations. IBM Storage Scale also delivers predictable low latency and high throughput, which many inference use cases require to deliver an interactive user experience. Finally, enterprise functionalities like tiering, quota management, failure handling, scaling, and access control are readily available for this new type of data. Many of these capabilities were developed to handle the incredibly stringent performance requirements of a typical HPC deployment. Thanks to software integration with frameworks like vLLM and llm-d, users can reap these benefits out of the box for achieving state-of-the-art AI inference cost and performance at scale, in production.

Figure 1 shows a reference inference deployment with IBM Storage Scale and llm-d. Internally, llm-d leverages the inference engine vLLM, which runs independently on every GPU server to execute inference calculations. In order to store KV tensors as files in a file system, llm-d/vLLM uses cache management components like LMCache that have built-in file system connectors. More specifically, LMCache requires basic file system interface and, optionally, distributed file sharing across the cluster, which IBM Storage Scale supports out of the box. So, to leverage IBM Storage Scale, an administrator simply needs to mount a Storage Scale file system on the GPU servers and point a file system connector to it.

Architecture of a storage-accelerated AI inference cluster.png
Figure 1: Architecture of a storage-accelerated AI inference cluster

Results

To demonstrate the power of IBM Storage Scale as an AI inference accelerator, we ran an experiment using Llama3-70B, served by vLLM, on four H100 GPUs with KV cache offload to either CPU DRAM or IBM Storage Scale. Results are depicted in Figure 2. The experiment confirmed some intuitions, for example that the fastest way to store and reuse KV data (aside from keeping it in the limited amount of GPU memory) is to cache it in the CPU DRAM present inside the same node. For input sizes of 128k context, the speedup of caching to DRAM (versus re-computing) is 23.6x.

Here is where things get exciting: When you add IBM Storage Scale as a KV cache tier, you get all of the benefits described above, including distributed sharing across the entire fleet of vLLM inference instances, along with an 8-12x speedup in time-to-first-token relative to recomputing, delivering absolute performance approaching that of CPU RAM caching (1.6 vs 0.8s). The exact value of the speedup depends on the recency (or “hotness”) of KV data in the file system. Adding this storage tier means you can meet stringent latency expectations at-scale (TTFT of 2 seconds or less, instead of 19) and use the GPUs for higher value operations, like token generation, instead of KV cache prefill. Perhaps even more importantly, this solution integrates natively with familiar inference services (such as llm-d in this case), meaning we deliver these gains while minimizing software complexity for the service operator.

Impact of IBM Storage Scale capacity tier on Time-To-First-Token (TTFT) for Llama3-70B, as a function of input prompt size.png
Figure 2: Impact of IBM Storage Scale capacity tier on Time-To-First-Token (TTFT) for Llama3-70B, as a function of input prompt size

What’s next

AI Inference with KVCache offload introduces a new set of requirements for storage systems, requiring both high bandwidth and low-latency access, a hard combination to achieve in most storage solutions. IBM Storage Scale’s unique combination of performance (300 GB/s, 13 Million IOPS w/ sub-microsecond latency per building block), scalability (100K+ nodes), versatility (acceleration of AI training and inference, HPC, analytics, and databases), and enterprise-readiness offers a value proposition that is hard to match. We are just beginning to scratch the surface of how high-performance storage can accelerate distributed inference. We see opportunities to deliver even better value by combining this work with solutions like content-aware storage, shortening time to insight for enterprise data to seconds, rather than hours and days. Stay tuned on this, as there’s lots more to come!

Related posts