Conference paper

Exploring Straightforward Methods for Automatic Conversational Red-Teaming

Abstract

Large language models (LLMs) are increasingly used in business dialogue systems but they also pose security and ethical risks. Multiturn conversations, in which context influences the model’s behavior, can be exploited to generate undesired responses. In this paper, we investigate the use of off-the-shelf LLMs in conversational red-teaming settings, where an attacker LLM attempts to elicit undesired outputs from a target LLM. Our experiments address critical questions and offer valuable insights regarding the effectiveness of using LLMs as automated red-teamers, shedding light on key strategies and usage approaches that significantly impact their performance. Our findings demonstrate that off-the-shelf models can serve as effective red-teamers, capable of adapting their attack strategies based on prior attempts. Allowing these models to freely steer conversations and conceal their malicious intent further increases attack success. However, their effectiveness decreases as the alignment of the target model improves.

Related