Deferred prefill for throughput maximization in LLM inference

Moonmoon Mohanty; Gautham Bolar; Preetam Patil; Umamaheswari Devi; Felix George; Pratibha Moogi; Parimal Parag

doi:10.1145/3721146.3721962

EuroMLSys 2025

Conference paper

31 Mar 2025

Deferred prefill for throughput maximization in LLM inference

Download paper

Abstract

Large language models (LLMs) have led to ground-breaking improvements in the capabilities of generative AI (Gen-AI) applications, leading to their increased adoption, which in turn is leading to increasing volumes of user requests at LLM inference deployments. The existing common implementations of LLM inference engines perform a new prefill every time there is a prompt departure. We analytically model the inference system for a fixed batch size with large rate of prompt arrivals and scheduling prefills after a fixed number of prompt departures. We characterize the throughput of the system as number of prompts departing per unit time for different thresholds. We observe that to maximize through-put, there exists an optimal threshold on the number of prompt departures. We verify this observation with vLLM experiments, and compare the optimal threshold predicted theoretically to the experimentally observed ones.

Conference paper