How Low Can LoRA Go: System-Level Throughput, Energy, and Model Quality Tradeoffs when Fine-Tuning Adapters

Connor Espenshade; Umesh Deshpande; Yue Zhu; Eun Kyung Lee; Martha Kim

ISCA 2025

Workshop paper

21 Jun 2025

How Low Can LoRA Go: System-Level Throughput, Energy, and Model Quality Tradeoffs when Fine-Tuning Adapters

Abstract

As models scale beyond trillions of parameters, extending their functionality is increasingly achieved through fine-tuning existing base models rather than training new ones from scratch. However, fine-tuning all parameters remains computationally expensive. Recent techniques such as Low-Rank Adaptation (LoRA) have been developed to reduce the number of trainable parameters. LoRA adapters have gained widespread adoption, but their effects on GPU system metrics, such as throughput and energy efficiency, are not yet well understood. In this study, we examine these system-level metrics as a function of the LoRA adapter rank. Our findings show that reducing the rank of LoRA adapters does not lead to a significant drop in model quality, while simultaneously improving throughput, energy efficiency, and memory usage. Further, we find that the presence of a LoRA adapter, rather than its rank size, can greatly improve model quality compared to a zero-shot inference base model. This makes smaller LoRA adapters a compelling choice for a variety of applications.

Workshop paper