Apoorve Mohan, Matthew Sheard
NVIDIA GTC 2022
Large language models (LLMs) have led to ground-breaking improvements in the capabilities of generative AI (Gen-AI) applications, leading to their increased adoption, which in turn is leading to increasing volumes of user requests at LLM inference deployments. The existing common implementations of LLM inference engines perform a new prefill every time there is a prompt departure. We analytically model the inference system for a fixed batch size with large rate of prompt arrivals and scheduling prefills after a fixed number of prompt departures. We characterize the throughput of the system as number of prompts departing per unit time for different thresholds. We observe that to maximize through-put, there exists an optimal threshold on the number of prompt departures. We verify this observation with vLLM experiments, and compare the optimal threshold predicted theoretically to the experimentally observed ones.
Apoorve Mohan, Matthew Sheard
NVIDIA GTC 2022
Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
Pranjal Gupta, Karan Bhukar, et al.
ICPE 2025
Archit Patke, Christian Pinto, et al.
ICS 2025