Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization AnalysisHongkang LiSongtao Luet al.2025ICLR 2025
When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear TransformersHongkang LiYihua Zhanget al.2025ICLR 2025
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-ExpertsMohammed Nowaz Rabbani ChowdhuryMeng Wanget al.2024ICML 2024