Zhiyuan He, Yijun Yang, et al.
ICML 2024
Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus perform the most detailed investigation to date on whether LLMs can reliably identify security-related bugs. We construct a series of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions in an automated framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios outside their knowledge cut-off date. Most importantly, our findings reveal significant non-robustness in even the most advanced models like PaLM2' and
GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.
Zhiyuan He, Yijun Yang, et al.
ICML 2024
Teryl Taylor, Frederico Araujo, et al.
Big Data 2020
Anisa Halimi, Leonard Dervishi, et al.
PETS 2022
Chengkun Wei, Shouling Ji, et al.
IEEE TIFS