Playing to Protect: A Game-Theoretic Approach to Safer Language Models

Author: Denis Avetisyan

New research demonstrates a novel framework for training language models to be more robust and aligned with human values through adversarial game dynamics.

A competitive framework pits an Attacker language model, generating prompt variations from an initial input, against a Defender language model striving for safety, with both systems refined through preference-based objectives evaluating faithfulness, compliance, and deflection.

This paper introduces AdvGame, a method leveraging non-cooperative game theory and reinforcement learning to optimize language model safety, compliance, and utility against adversarial attacks.

Achieving both safety and helpfulness in large language models remains a fundamental challenge, often addressed through iterative adversarial training. This paper, ‘Safety Alignment of LMs via Non-cooperative Games’, introduces a novel framework, AdvGame, which reframes safety alignment as a non-zero-sum game between competing attacker and defender language models trained via online reinforcement learning. AdvGame demonstrably shifts the trade-off between safety and utility, yielding a defender model more resilient to adversarial prompts while maintaining helpfulness, and simultaneously generating a strong, general-purpose red-teaming agent. Could this game-theoretic approach unlock more robust and scalable solutions for aligning increasingly powerful language models?

The Inevitable Tightrope Walk: Aligning AI with Reality

Recent advancements in artificial intelligence have yielded large language models, such as Llama3.1-8B and Qwen2.5-7B, which showcase an impressive capacity for generating human-quality text and tackling complex tasks. However, this burgeoning capability is accompanied by inherent risks; these models aren’t inherently aligned with human values and can, without careful guidance, produce responses that are harmful, biased, or simply unhelpful. This potential for problematic outputs stems from the models’ training on vast datasets reflecting the complexities – and flaws – of the real world. While capable of sophisticated reasoning and creative content generation, the models lack intrinsic understanding and can easily be prompted to generate inappropriate or misleading information, necessitating robust safety mechanisms and ongoing refinement.

Conventional techniques for aligning large language models frequently encounter a difficult trade-off between safety and practical usefulness. Methods designed to prevent harmful outputs often result in models that are excessively cautious, declining to answer legitimate queries or providing unhelpful, overly-generalized responses. Conversely, models optimized for utility, capable of engaging with a wider range of prompts, become vulnerable to adversarial attacks – subtly crafted inputs that bypass safety mechanisms and elicit inappropriate or dangerous content. This precarious balance arises because current alignment strategies often treat safety and utility as competing objectives, rather than integrated components of responsible AI development, leaving a significant gap in reliably deploying LLMs that are both helpful and harmless.

Achieving truly responsible large language model deployment hinges on simultaneously maximizing safety and compliance, a feat proving remarkably difficult with current techniques. While efforts to prevent harmful outputs are ongoing, they frequently result in models that unnecessarily refuse benign requests, hindering usability. This delicate balance is further complicated by the persistent vulnerability to adversarial attacks; studies reveal that over half of benchmark attempts successfully bypass safety protocols, eliciting undesirable responses. The prevalence of these attacks demonstrates that simply blocking keywords or employing superficial filters is insufficient; robust alignment requires a more nuanced understanding of intent and context to distinguish between genuinely harmful queries and those that are merely phrased in a challenging manner, demanding ongoing research into more resilient and adaptable safety mechanisms.

Our safety post-training method for Qwen2.5-7B-Instruct maintains or improves model accuracy while significantly increasing robustness against adversarial attacks, surpassing the performance of self-improvement baselines.

Adversarial Games: Letting the Machines Fight It Out

AdvGame establishes an adversarial framework utilizing a non-zero-sum game between two large language models: an Attacker and a Defender. This setup diverges from traditional reinforcement learning by fostering a competitive collaboration; the Attacker aims to generate inputs that elicit undesirable responses from the Defender, while the Defender simultaneously learns to neutralize these attacks and maintain safe, helpful outputs. The non-zero-sum dynamic ensures that improvements in either model-enhanced attack strategies or more robust defense mechanisms-do not necessarily diminish the performance of the opposing model, enabling a continuous cycle of iterative refinement for both. This contrasts with purely competitive scenarios where one model’s gain is another’s loss, and allows for sustained progress in safety and robustness.

The Attacker model within AdvGame is specifically trained to generate adversarial prompts intended to identify and exploit vulnerabilities in the Defender model’s responses. This training process focuses on maximizing the success rate of these attacks, and evaluations demonstrate performance comparable to current state-of-the-art adversarial techniques; specifically, the Attacker achieves approximately 56% attack success rate when targeting the Qwen language model. This metric represents the proportion of generated prompts that elicit undesirable or unsafe responses from the Defender, indicating the effectiveness of the Attacker in uncovering weaknesses.

The Defender Model within the AdvGame framework improves its performance through iterative exposure to adversarial prompts generated by the Attacker Model. This process facilitates learning to identify and neutralize potentially harmful input, resulting in increased robustness against adversarial attacks. Consequently, the Defender not only enhances its safety by minimizing the generation of unsafe responses, but also maintains or improves its utility by continuing to provide helpful and relevant information even when presented with challenging prompts. This dual improvement in safety and utility is achieved without relying solely on reward maximization, instead leveraging the dynamic interplay between attack and defense.

AdvGame departs from traditional reinforcement learning approaches reliant on static reward functions by establishing a competitive and collaborative dynamic between two models. This iterative process, where the Attacker attempts to breach the Defender’s safety mechanisms and the Defender adapts to counter those attempts, creates a continuously evolving challenge. Consequently, both models are driven to improve beyond the limitations imposed by a fixed reward signal; the Attacker refines its adversarial prompting techniques, while the Defender develops more robust safety protocols and response strategies. This dynamic fosters ongoing improvement in both attack and defense capabilities, leading to more resilient and safer language models compared to those trained solely on reward maximization.

Experiments with Qwen2.5-7B in the AdvGame-DPO-M framework reveal that training the attacker model alongside the defender, employing a pairwise judge, and utilizing an on-policy generation model significantly improve both utility and safety.

The Toolkit: Algorithms for a More Reliable Defense

AdvGame utilizes Direct Preference Optimization (DPO) and Iterative Policy Optimization (IPO) as core components for refining the Defender Model’s policy. Both DPO and IPO are preference optimization algorithms that learn from pairwise comparisons; instead of requiring absolute reward signals, they are trained on data indicating which of two model responses is preferred by a human evaluator or a reward model. This pairwise preference feedback is used to directly optimize the policy, maximizing the likelihood of generating preferred responses. The algorithms achieve this by framing the reinforcement learning problem as a supervised learning problem, simplifying training and improving sample efficiency compared to traditional reward-based methods. This approach allows the Defender Model to learn complex alignment objectives from relatively limited human feedback.

Generalized Reward-shaped Policy Optimization (GRPO) is a reinforcement learning algorithm utilized within AdvGame to provide the Defender Model with pointwise reward signals. Unlike preference-based methods like DPO and IPO which rely on comparisons, GRPO directly assigns a scalar reward value to each state-action pair encountered during training. This direct reward signal offers an additional and complementary pathway for shaping the Defender’s behavior, allowing it to learn from immediate feedback on the quality of its actions. The use of pointwise rewards, in conjunction with preference optimization, enables a more comprehensive training regime and contributes to improved policy refinement.

AdvGame utilizes Exponential Moving Average (EMA) techniques to improve the stability and performance of off-policy learning within the Defender Model training process. EMA functions by calculating a weighted average of past values, giving more weight to recent data while retaining information from the entire history. This smoothing effect reduces variance in policy updates, preventing drastic shifts caused by noisy or infrequent preference data. Specifically, EMA is applied to the model’s policy parameters, creating a target network that is updated more gradually than the primary network, thus enhancing learning stability and enabling more reliable off-policy evaluation and improvement.

The AdvGame framework utilizes a combination of preference optimization and reinforcement learning algorithms – including Direct Preference Optimization (DPO), Iterative Policy Optimization (IPO), and Gradient-based Policy Optimization (GRPO) – to facilitate learning from diverse feedback signals. DPO and IPO leverage pairwise comparisons to refine the Defender Model’s policy, while GRPO provides pointwise rewards. This multi-faceted approach allows the model to integrate information from different sources, enhancing the stability and reliability of the alignment process. Benchmarking demonstrates that this combined methodology results in measurable improvements across key safety metrics, indicating a more robust and dependable Defender Model compared to systems trained with single-algorithm approaches.

During training on Qwen2.5-7B with the WJB dataset, DPO-MD and IPO-MD demonstrated stable reward curves, while GRPO exhibited high variability and lower overall reward.

Testing the Limits: Performance on the Front Lines

To rigorously assess its defensive capabilities, AdvGame underwent evaluation using established safety benchmarks like HarmBench and WildJailbreak. These benchmarks are specifically engineered to probe a model’s susceptibility to adversarial attacks – cleverly crafted prompts designed to bypass safety protocols and elicit harmful responses. HarmBench focuses on generating dangerous content, while WildJailbreak tests the model’s resistance to jailbreaking attempts, where users try to circumvent restrictions. By subjecting AdvGame to these challenging tests, researchers could quantify its ability to withstand malicious inputs and maintain safe, compliant outputs, providing a standardized measure of its robustness against real-world threats.

The Defender Model, refined through the AdvGame framework, showcases a marked elevation in both safety protocols and adherence to established guidelines when confronted with adversarial challenges. Rigorous testing on the HarmBench benchmark reveals a critical improvement: the model now withstands a substantial majority of attacks, achieving an adversarial attack success rate of under 50%. This signifies a considerable leap forward in the robustness of AI systems, demonstrating a heightened capacity to resist malicious prompts and maintain secure operational boundaries. The lowered success rate indicates a tangible reduction in the potential for harmful outputs, bolstering confidence in the model’s reliability and responsible behavior when deployed in real-world applications.

The capacity of this framework to consistently generate robust defenses against diverse adversarial prompts signifies a substantial step toward constructing artificial intelligence systems that are demonstrably more reliable and trustworthy. By proactively anticipating and neutralizing a broad spectrum of manipulative inputs, the system minimizes the risk of unintended, harmful, or biased outputs. This resilience is not achieved through simple filtering, but through a learned ability to discern malicious intent and maintain constructive responses, even when faced with sophisticated attempts to circumvent safety protocols. The resulting AI exhibits greater stability and predictability in challenging scenarios, fostering increased confidence in its deployment across sensitive applications and ultimately paving the way for more responsible innovation in the field.

Recent advancements in artificial intelligence safety have yielded models capable of simultaneously improving protective measures without sacrificing performance. Specifically, the integration of the AdvGame framework with both Direct Preference Optimization (DPO) and Iterative Prompting Optimization (IPO) – resulting in AdvGame-DPO-MD and AdvGame-IPO-MD – showcases a notable equilibrium between safety, utility, and compliance. Evaluations reveal these models maintain a level of functionality comparable to their original counterparts, meaning they continue to effectively perform intended tasks, while demonstrably increasing resilience against harmful or manipulative prompts. This achievement represents a significant step towards deploying AI systems that are not only powerful but also reliably aligned with human values and safety standards, paving the way for more trustworthy and responsible AI applications.

Experiments with the <span class="katex-eq" data-katex-display="false">AdvGame-DPO</span> method on Qwen2.5-7B explore the impact of attacker chat templates (thinking vs. non-thinking), judge model size (7B vs. 32B), and attacker response optimism on utility and safety, as shown in the accompanying evaluations. — Experiments with the $AdvGame-DPO$ method on Qwen2.5-7B explore the impact of attacker chat templates (thinking vs. non-thinking), judge model size (7B vs. 32B), and attacker response optimism on utility and safety, as shown in the accompanying evaluations.

The pursuit of ‘safer’ language models, as outlined in this framework, feels predictably Sisyphean. One builds defenses, another finds exploits – a non-cooperative game, indeed. It’s all very elegant on paper, this adversarial training and preference optimization, until production systems encounter edge cases nobody anticipated. Donald Davies famously observed, “The only thing worse than a system that crashes is one that doesn’t.” And one suspects that even the most robustly aligned model will eventually reveal unforeseen vulnerabilities. This isn’t about solving safety; it’s about delaying the inevitable entropy. One doesn’t write code – one leaves notes for digital archaeologists, documenting the exquisitely crafted illusions before they crumble.

What’s Next?

The elegance of framing safety as a non-cooperative game is… predictable. Every optimization eventually becomes a search for exploitable edges. AdvGame, as presented, addresses a specific formulation of the problem – adversarial attacks constructed in a particular way. Production, however, does not adhere to carefully constructed benchmarks. The bug tracker will fill with variations no one anticipated, edge cases born of user creativity and malice. It is a temporary reprieve, not a solution.

Future work will inevitably focus on scaling these game-theoretic approaches. Larger models, more complex attack surfaces, and the need for real-time adaptation will push the boundaries of computational resources. More fundamentally, the question remains: whose preferences are being optimized? The current framework assumes a clearly defined ‘safe’ outcome. It ignores the messy reality of cultural nuance and evolving societal norms. A model deemed ‘safe’ today may be considered biased or harmful tomorrow.

The pursuit of ‘alignment’ feels less like engineering and more like a prolonged exercise in damage control. The system doesn’t become safe; it becomes incrementally less dangerous. It doesn’t deploy – it lets go.

Original article: https://arxiv.org/pdf/2512.20806.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Tightrope Walk: Aligning AI with Reality

Adversarial Games: Letting the Machines Fight It Out

The Toolkit: Algorithms for a More Reliable Defense

Testing the Limits: Performance on the Front Lines

What’s Next?

See also: