Shawn Tan, Songlin Yang, et al.
ICLR 2025
Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on LLAMA-3-8B-INSTRUCT and MERLINITE-7B models. Our code is available on github https://github.com/hanshen95/SEAL.
Shawn Tan, Songlin Yang, et al.
ICLR 2025
Xinran Wang, Qi Le, et al.
ICLR 2025
Minkyong Kim, Zhen Liu, et al.
INFOCOM 2008
Daniel M. Bikel, Vittorio Castelli
ACL 2008