Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM InferencePol G. RecasensFerran Agulloet al.2025CLOUD 2025
Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM InferenceYue ZhuHao Yuet al.2025CLOUD 2025
Towards Pareto Optimal Throughput in Small Language Model ServingPol G. RecasensYue Zhuet al.2024EuroMLSys 2024
Characterizing Training Performance and Energy for Foundation Models and Image Classifiers on Multi-Instance GPUsConnor EspenshadeRachel Penget al.2024EuroMLSys 2024