Juan Miguel De Haro, Rubén Cano, et al.
IPDPS 2022
Recent advancements in large language models (LLMs) have significantly revolutionized natural language processing, empowering machines to produce text akin to humans and participate in meaningful dialogues. As these systems grow in complexity, demanding substantial computational and memory resources, there is an urgent need for enhanced speed, efficiency, and accessibility in LLM inference. Due to their size, LLMs now exceed the capacity of single GPUs, necessitating intricate, expert-level configurations for parallel operation. The cost of memory accesses surpasses that of computations, creating a barrier to efficient scaling. In-memory computing (IMC) technologies present a promising avenue to alleviate the growing data communication for LLMs by performing computations directly within memory. By merging memory and computing components seamlessly, IMC alleviates the von Neumann bottleneck, lessening data transfers, and enhancing energy efficiency.
In this tutorial, we will first present the Transformer architecture at the core of modern LLMs and outline the challenges it poses for hardware acceleration. This will be followed by a brief presentation of conventional approaches to run LLMs on general-purpose and special-purpose digital accelerators. We will then cover the DRAM processing-in-memory (PIM) accelerators that have recently emerged to improve throughput and energy efficiency over GPU baselines. Finally, we will review various IMC architectures with volatile and non-volatile memristive devices, as well as Flash memory, and discuss how they can address the memory access bottlenecks through large on-chip weight capacity. Emphasis will be made on the device-level and system-level requirements of LLM architectures for IMC acceleration, as well as challenges in maintaining high inference accuracy with this approach.
Juan Miguel De Haro, Rubén Cano, et al.
IPDPS 2022
Eric A. Joseph
AVS 2023
Stefano Ambrogio
MRS Spring Meeting 2022
David Stutz, Nandhini Chandramoorthy, et al.
MLSys 2021