Matías Mazzanti, Esteban Mocskos, et al.
ISCA 2025
As models scale beyond trillions of parameters, extending their functionality is increasingly achieved through fine-tuning existing base models rather than training new ones from scratch. However, fine-tuning all parameters remains computationally expensive. Recent techniques such as Low-Rank Adaptation (LoRA) have been developed to reduce the number of trainable parameters. LoRA adapters have gained widespread adoption, but their effects on GPU system metrics, such as throughput and energy efficiency, are not yet well understood. In this study, we examine these system-level metrics as a function of the LoRA adapter rank. Our findings show that reducing the rank of LoRA adapters does not lead to a significant drop in model quality, while simultaneously improving throughput, energy efficiency, and memory usage. Further, we find that the presence of a LoRA adapter, rather than its rank size, can greatly improve model quality compared to a zero-shot inference base model. This makes smaller LoRA adapters a compelling choice for a variety of applications.
Matías Mazzanti, Esteban Mocskos, et al.
ISCA 2025
Ilias Iliadis
International Journal On Advances In Networks And Services
Juan Miguel De Haro, Rubén Cano, et al.
IPDPS 2022
Oleg Kolosov, Gala Yadgar, et al.
ICDCS 2023