As Large Language Models (LLMs) continue to advance in capability and influence, ensuring their safety and preventing harmful outputs has become crucial. A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming. However, the evolving subtlety of vulnerabilities in LLMs challenges the effectiveness of current adversarial methods, which struggle to generate diverse, complex prompts and dynamically explore the weaknesses of these models.

To tackle these challenges, we introduce the Self-Evolving Adversarial Safety (SEAS) optimization framework, which includes both a SEAS dataset and a SEAS pipeline. The SEAS dataset comprises complex adversarial prompts, while the SEAS pipeline operates through three stages: Initialization, Attack, and Adversarial Optimization. This framework generates a diverse range of adversarial prompts and dynamically explores the model's vulnerabilities to enhance its safety. Our contributions include a novel adversarial framework, a comprehensive safety dataset, and empirical evidence demonstrating the effectiveness of SEAS. After three iterations, the model achieves a safety level comparable to GPT-4.

🚨‼️‼️ Warning: this paper includes examples that may be offensive or harmful.

At the beginning of each Attack Stage, we construct seed prompts by specifying a type and concatenating a fixed number (\( k \)) of prompts from the SEAS dataset's training set. This activates the Red Team model🗡️ to generate adversarial prompts. In order to ensure the diversity of the Red Team model🗡️'s output, we adopted nucleus sampling and carried out multiple samplings to generate \( n \) prompts. Following this, we input these prompts to the Target model🛡️, also conducting nucleus sampling and multiple samplings, to obtain \( m \) output responses.

By concatenating \( n \) adversarial prompts with \( m \) responses and processing them through a Safe Classifier for safety evaluation, we obtain \( n \times m \) tuples of {\(seed\) \(prompt\), \(adversarial\) \(prompt\), \(response\), \( label \)}, where label = 1 represents unsafe. Please note that the safety assessment specifically pertains to the response.

Distribution of SEAS Dataset

Distribution of SEAS Train.

Case Study on SEAS

Examples of Risk Categories and Attack Styles on SEAS (Masked Sensitive Terms).

Examples of Harmless set on SEAS

Experiment Results

BibTeX


      @article{diao2024seas,
        title={SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models},
        author={Diao, Muxi and Li, Rumei and Liu, Shiyang and Liao, Guogang and Wang, Jingang and Cai, Xunliang and Xu, Weiran},
        journal={arXiv preprint arXiv:2408.02632},
        year={2024}
      }

SEAS: Self-Evolving Adversarial Safety

Optimization for Large Language Models

Introduction

SEAS Framework

Initialization Stage

Attack Stage

Adversarial Optimization Stage

SEAS Dataset

Distribution of SEAS Dataset

Case Study on SEAS

Experiment Results

BibTeX