Hidden Meanings: Cracking Image Generators with Clever Prompts

Author: Denis Avetisyan

Researchers have demonstrated a novel attack that exploits the creative reasoning of large language models to generate prompts that bypass safety filters in text-to-image systems.

Inspired by the game Taboo, a system conceptualizes a large language model as a communicator and a text-to-image model as a guesser, leveraging metaphor-based descriptions to subtly bypass safety protocols and induce unintended image generation-a technique mirroring the implicit communication required to win the game.

This work introduces a metaphor-based jailbreaking attack, revealing vulnerabilities in current defense mechanisms and highlighting the need for more robust prompt optimization strategies.

Despite increasing safeguards, text-to-image models remain vulnerable to adversarial prompts capable of bypassing safety mechanisms and generating sensitive content. This paper introduces a novel attack framework, ‘Metaphor-based Jailbreaking Attacks on Text-to-Image Models’, which leverages large language models to craft metaphor-based prompts that effectively circumvent diverse defense strategies-without prior knowledge of their implementation. By decomposing prompt generation into metaphor retrieval, context matching, and adversarial refinement, the proposed method, MJA, consistently outperforms existing techniques while minimizing query requirements. These findings highlight a critical need for more robust and adaptive defense mechanisms within text-to-image generation systems-and raise the question of how to proactively address vulnerabilities arising from creative, metaphor-driven attacks.

Unmasking the Illusion: Vulnerabilities in Generative Models

Text-to-image models, despite their impressive ability to synthesize visuals from language, exhibit a surprising vulnerability to malicious prompting. Researchers have demonstrated that carefully constructed prompts – often subtle manipulations of phrasing or the inclusion of seemingly innocuous keywords – can reliably bypass safety mechanisms and elicit the generation of harmful content. This isn’t simply a matter of asking for explicit depictions; the models can be ‘tricked’ into creating biased imagery, propagating stereotypes, or even generating depictions of illegal activities through indirect requests. The core issue lies in the model’s reliance on statistical correlations within the training data; it learns to associate certain phrases with visual concepts, and clever prompting can exploit these associations to produce unintended and potentially dangerous results, highlighting a critical need for robust safeguards.

Current safeguards for text-to-image models frequently prove inadequate, largely because they operate on a principle of identifying and blocking explicitly flagged keywords or simple patterns. These defenses, while seemingly practical, are easily circumvented through subtle prompt engineering – techniques like rephrasing requests, employing metaphorical language, or utilizing misspellings to disguise harmful intent. The limitations stem from a reliance on lexical matching rather than a true understanding of semantic meaning; a model might block “violent imagery” but readily generate a similar scene described with euphemisms or artistic allusions. Consequently, adversarial prompts, carefully constructed to bypass these superficial filters, can consistently elicit the generation of undesirable content, highlighting a critical need for more robust and context-aware defense strategies.

The fundamental susceptibility of text-to-image models to malicious prompting demands a shift beyond current reactive safeguards. Existing defenses, frequently based on keyword filtering or simplistic pattern recognition, prove easily circumvented by adversarial techniques, highlighting a critical need for proactive strategies. Responsible development in this field requires innovative approaches that anticipate potential misuse, moving from symptom treatment to addressing the underlying vulnerabilities within the models themselves. This includes exploring robust training methodologies, developing more nuanced understanding of prompt semantics, and fostering collaborative research into both attack vectors and defensive countermeasures, ultimately ensuring these powerful tools are deployed safely and ethically.

Model-generated adversarial prompts, created by modifying sensitive inputs, successfully bypass safety filters in the Stable Diffusion V1.4 text-to-image model.

The Art of Deception: Metaphor as an Exploitable Weakness

Metaphor Jailbreaking Attack (MJA) circumvents safety filters in Large Language Models (LLMs) by employing indirect language and relying on contextual interpretation. Rather than directly requesting prohibited content, MJA formulates prompts that reference sensitive topics through metaphorical representations or analogies. This approach attempts to exploit the LLM’s ability to understand nuanced language while obscuring the harmful intent from simpler content filters designed to identify direct requests for restricted material. The success of MJA hinges on the LLM correctly interpreting the metaphorical prompt and generating a response related to the intended, yet masked, topic, effectively bypassing the safety mechanisms intended to prevent the generation of harmful or inappropriate content.

Metaphor Jailbreaking Attack (MJA) employs a technique analogous to the game Taboo, where the goal is to convey a concept without using explicitly forbidden words. This is achieved by rephrasing sensitive or restricted topics using metaphorical language and indirect references. By framing the request in terms of analogous concepts, MJA aims to bypass safety filters that rely on keyword detection. These filters, often described as “naive” detectors, struggle to identify the underlying harmful intent when it is obscured by metaphorical representation, effectively circumventing content restrictions without directly violating prohibited word lists.

The MJA attack employs a multi-agent system built upon Large Language Models (LLMs) to generate a high volume of adversarial prompts. This system consists of multiple LLM instances, each tasked with a specific role in prompt creation – such as brainstorming metaphorical representations or refining prompts to evade detection. By leveraging the diverse capabilities of these agents and iteratively generating variations, the attack significantly expands the search space for successful jailbreaks. This approach contrasts with single-prompt attacks, increasing the probability of discovering prompts that bypass safety filters due to the sheer number of attempts and the varied linguistic strategies employed. The system’s architecture is designed to automatically explore and exploit weaknesses in the target LLM’s safety mechanisms, adapting prompt structures based on the success or failure of previous iterations.

This framework employs a multi-agent generation module and adversarial prompt optimization to iteratively refine prompts, leveraging specialized agents and a surrogate model to identify and exploit vulnerabilities in text-to-image models.

Dissecting the Attack: Optimizing Prompts for Maximum Impact

Adversarial Prompt Optimization, as utilized by MJA, is a technique for discovering inputs that successfully circumvent safety mechanisms in text-to-image (T2I) models. This process systematically refines prompts – the textual instructions given to the model – to maximize the probability of generating a desired, potentially restricted, output. Unlike random or manually crafted prompts, optimization algorithms iteratively adjust prompt phrasing based on feedback from a predictive model, allowing for efficient exploration of the prompt space and identification of inputs that are most likely to bypass existing defenses. This targeted approach contrasts with methods that rely on exhaustive search or human intuition, resulting in a more streamlined and effective strategy for generating adversarial examples.

The optimization process leverages a Surrogate Model to estimate the likelihood of a prompt successfully bypassing safety mechanisms in the target Text-to-Image (T2I) model, thereby reducing the computational cost of direct queries. This model is trained on a dataset of prompts and their corresponding success or failure outcomes against the T2I model’s defenses. By predicting prompt efficacy, the Surrogate Model allows for the prioritization of promising prompts for evaluation, minimizing the number of expensive and time-consuming queries to the target T2I model itself. This indirect assessment is critical for efficiently exploring the prompt space and identifying adversarial prompts without exhaustively testing every possible variation.

The Acquisition Strategy within MJA is designed to efficiently identify high-performing adversarial prompts by prioritizing testing based on predicted success. Rather than randomly sampling prompts, this strategy utilizes the Surrogate Model to estimate the likelihood of a prompt bypassing safety mechanisms and generating the desired, potentially harmful, output. This allows MJA to focus computational resources on prompts with the greatest potential for success, reducing the total number of queries needed to achieve a high attack success rate. The strategy incorporates feedback from the Surrogate Model after each query, refining its predictions and iteratively improving the selection of prompts for subsequent testing phases, ultimately maximizing efficiency and overall attack performance.

Testing of the MJA framework using the InternVL2-8B text-to-image model indicates a high degree of success in generating Not Safe For Work (NSFW) content, even when defense mechanisms are active. Across a range of defense settings, MJA achieved an average bypass rate of 0.98, signifying its ability to circumvent protective filters in nearly all attempts. This translated to an overall attack success rate (ASR) of 0.76, demonstrating consistent generation of the targeted NSFW content. These results were obtained through systematic evaluation of MJA’s performance against various defense strategies implemented with the InternVL2-8B model.

The MJA framework achieves an Attack Success Rate on Multimodal Large Language Models (ASR-MLLM) of 0.79, indicating a 79% success rate in generating targeted outputs. Critically, this performance is achieved with a significantly reduced query count; MJA requires an average of 11 ± 8 queries to successfully bypass defenses, contrasted with the 19 ± 19 queries demanded by standard iterative attack baselines. This represents a substantial improvement in efficiency, allowing for more rapid identification of effective adversarial prompts and reducing the computational cost associated with attacking the target model.

The optimized prompts generated by MJA exhibit enhanced semantic consistency compared to those produced by baseline adversarial methods. This improvement is quantitatively demonstrated through lower Frechet Inception Distance (FID) values; a lower FID score indicates a closer match between the distribution of generated images and real images, signifying greater semantic coherence. This suggests that MJA’s optimization process not only bypasses defense mechanisms but also produces prompts that elicit more semantically plausible and realistic outputs from the target text-to-image model, improving the quality of the adversarial results beyond simply achieving a successful bypass.

MJA outperforms baseline methods across key metrics (BR, ASR-C, ASR-MLLM, and FID) even with both external and internal defenses, though internal defenses prove ineffective at blocking sensitive inputs, resulting in complete bypass.

Beyond Surface Checks: The Imperative for Semantic Understanding

Current AI safety defenses, as demonstrated by the MJA research, frequently exhibit a critical flaw: reliance on identifying surface-level patterns rather than genuine understanding. These systems often function by scanning inputs for specific keywords or phrases associated with harmful requests, creating a brittle defense easily bypassed by adversarial attacks. Cleverly crafted prompts, employing techniques like paraphrasing or subtle alterations, can effectively evade these superficial checks, allowing malicious instructions to slip through undetected. The MJA underscores that such pattern-matching approaches are fundamentally vulnerable, highlighting the urgent need for more sophisticated defense mechanisms capable of discerning the underlying intent and potential harm of a prompt, rather than merely reacting to its literal wording.

Recent adversarial attacks demonstrate the fragility of current AI safety measures, which frequently depend on identifying superficial patterns within prompts. This vulnerability necessitates a shift towards more fundamental defense strategies, exemplified by Internal Defense techniques like Concept Erasure. Concept Erasure doesn’t simply block keywords; it actively modifies the AI’s internal representation of sensitive concepts, making it far more resistant to manipulation. By fundamentally altering how the AI understands a prompt, rather than merely reacting to its surface features, this approach aims to neutralize attacks that cleverly bypass traditional filters. Successfully implementing such internal defenses represents a crucial step towards building AI systems that are not only safer but also more resilient to increasingly sophisticated adversarial tactics, promising a future where AI behavior aligns with intended purpose, even under pressure.

Current AI safety measures frequently stumble by focusing on lexical patterns – the specific keywords present in a prompt – rather than the underlying request. This superficial approach leaves systems susceptible to cleverly disguised malicious instructions, even when those instructions avoid flagged terminology. Researchers are increasingly prioritizing the development of AI capable of discerning intent, moving beyond simple keyword detection to analyze the prompt’s overall goal and contextual meaning. This shift necessitates advancements in areas like natural language understanding and commonsense reasoning, allowing systems to interpret what a user truly wants, not just what they explicitly say. Successfully implementing intent recognition promises a more robust defense against adversarial attacks and a pathway toward AI that is both creatively responsive and inherently safe, ensuring alignment with human values and expectations.

Advancing artificial intelligence requires a deeper engagement with the nuances of human language, specifically metaphor and contextual reasoning. Current AI often struggles with prompts that rely on implied meaning or figurative language, leading to unpredictable and potentially unsafe outputs. Successfully navigating these linguistic complexities isn’t merely about enhancing creativity; it’s fundamental to ensuring safety. An AI capable of discerning the underlying intent and broader context of a request – understanding, for instance, that a metaphorical question doesn’t demand a literal answer – is far less susceptible to adversarial manipulation. Future development should therefore prioritize research into computational models that can not only process words but also interpret the rich web of associations, cultural understandings, and implicit knowledge that underpin human communication, ultimately fostering AI systems that are both imaginative and reliably aligned with human values.

MJA outperforms six baseline methods across both external and internal defenses when evaluated by BR, ASR-C, ASR-MLLM, and FID metrics averaged across four sensitive categories, with higher scores indicating better performance except for FID.

The research meticulously details how seemingly innocuous metaphorical prompts can dismantle the safety barriers within text-to-image models. This echoes a sentiment articulated by John McCarthy: “If you can’t break it, you don’t understand it.” The study doesn’t merely identify vulnerabilities; it actively tests the limits of these systems, revealing how Large Language Models, when employed strategically, can circumvent existing defense mechanisms. By generating adversarial prompts based on metaphor, the method, MJA, effectively reverse-engineers the model’s understanding of safety constraints, highlighting that true comprehension demands a willingness to probe and, if necessary, dismantle existing structures. The work demonstrates the necessity of continually challenging assumptions about AI safety.

What’s Next?

The demonstrated efficacy of metaphor-based jailbreaking, while not entirely surprising, highlights a fundamental truth: safety measures in text-to-image models are, at best, reactive fortifications. Each defense, a frantic attempt to patch the symptom, not address the underlying vulnerability – the models’ inherent susceptibility to linguistic manipulation. The study suggests the best hack is understanding why it worked, not simply cataloging the successful prompts. Indeed, every patch is a philosophical confession of imperfection.

Future work must move beyond the adversarial game of prompt refinement and focus on architectural shifts. Can models be designed to inherently ‘understand’ intent, separating permissible creative exploration from the generation of harmful content? Or is that a fool’s errand – expecting a statistical engine to possess genuine comprehension? A more fruitful avenue may lie in explicitly modeling the boundaries of acceptable output, defining what a model shouldn’t generate, rather than attempting to filter after the fact.

Ultimately, the persistence of these attacks isn’t a failing of specific defenses, but a consequence of attempting to constrain a fundamentally open-ended system. The challenge isn’t to build a fortress, but to understand the nature of the breach. The question isn’t can these models be broken, but how – and what does that reveal about the limits of artificial intelligence itself?

Original article: https://arxiv.org/pdf/2512.10766.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unmasking the Illusion: Vulnerabilities in Generative Models

The Art of Deception: Metaphor as an Exploitable Weakness

Dissecting the Attack: Optimizing Prompts for Maximum Impact

Beyond Surface Checks: The Imperative for Semantic Understanding

What’s Next?

See also: