Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Memory consumption is a key bottleneck in deploying large Mixture-of-Experts (MoE) transformer models, particularly on edge and resource-constrained devices. While MoE architectures improve compute efficiency through sparse activation of experts, the parameters across experts significantly increase the memory footprint. This work introduces MoSE (Mixture of Shared Experts), an exploratory study on reducing memory usage in MoE models through structured weight sharing among experts. Instead of maintaining fully independent expert parameters, MoSE simulates shared weights by pairing experts based on similarity metrics and replacing their parameters with averaged values, effectively emulating weight sharing without modifying the underlying framework.
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Miao Guo, Yong Tao Pei, et al.
WCITS 2011