Talk

A Systematic Benchmarking Methodology for Efficient LLM Inference Evaluation

Abstract

Organizations deploying LLM inference often face critical decisions about hardware procurement, software stack selection, and deployment configurations. Today, these decisions are frequently made through ad-hoc testing, which consumes significant GPU resources and often leads to suboptimal outcomes. The diversity of deployment environments, model architectures, inference frameworks, and accelerator hardware makes exhaustive benchmarking impractical.

FMwork is a systematic benchmarking methodology that addresses this challenge by narrowing both the input configuration space and the output metrics space to focus on the most informative parameters and indicators. This targeted approach accelerates evaluation, reduces resource waste, and enables consistent, reproducible comparisons across platforms.

In a representative study, FMwork achieved over an order-of-magnitude reduction in total benchmarking time compared to a naïve exhaustive sweep, while capturing the key trends needed for deployment decisions across NVIDIA, AMD, and Intel GPUs. By providing an open, extensible framework, FMwork benefits the broader HPC and AI community through more efficient, sustainable performance evaluation.