The Violation State:
Safety-State Persistence in ChatGPT's Image Generation

Preprint

The Violation State visualization
Summary

This paper starts from something very simple. You ask ChatGPT to remove a watermark from a copyrighted photo. It correctly refuses. That part works. What follows is the problem: for the rest of that conversation, harmless image prompts like kitchens, bedrooms, patterns, or coffee cups keep getting blocked, while text and code requests still work.

The study shows that this is not a one off glitch. Across 30 contaminated sessions, 96.67 percent of image requests were refused, compared to zero refusals in 10 clean control sessions that never touched a copyrighted image. The refusals persist across time, across turns, and across rate limits. Only starting a new chat clears it.

The paper calls this behavior safety state persistence. A single copyright refusal leaves the conversation in a contaminated state that shuts down image generation as a whole, not just the specific action that triggered it. The work treats this as a behavioral finding, not a claim about internal architecture, and uses it to open up a new target for evaluation: session level safety dynamics in multimodal systems.

Abstract

Multimodal AI systems integrate text generation, image generation, and other capabilities within a single conversational interface. These systems employ safety mechanisms to prevent disallowed actions, including the removal of watermarks from copyrighted images. While single-turn refusals are expected, the interaction between safety filters and conversation-level state is not well understood.

This study documents a reproducible behavioral effect in the ChatGPT (GPT-5.1) web interface. When a conversation begins with an uploaded copyrighted image and a request to remove a watermark, which the model correctly refuses, subsequent prompts to generate unrelated, benign images (e.g., kitchens, bedrooms, abstract geometric patterns, coffee cups) are refused for the remainder of the session. Importantly, text-only requests (e.g., generating a Python function) continue to succeed.

Across 40 manually run sessions (30 contaminated and 10 controls), contaminated threads showed 116/120 image-generation refusals (96.67%), while control threads showed 0/40 refusals (Fisher's exact p < 0.0001). All sessions used an identical fixed prompt order, ensuring sequence uniformity across conditions.

We describe this as safety-state persistence: a form of conversational over-generalization in which a copyright refusal influences subsequent, unrelated image-generation behavior. We present these findings as behavioral observations, not architectural claims. We discuss possible explanations, methodological limitations (single model, single interface), and implications for multimodal reliability, user experience, and the design of session-level safety systems.

These results highlight the need for closer examination of session-level safety interactions in multimodal AI systems, especially when safety decisions propagate beyond their intended scope.

Why It Matters

Users experience this effect as a silent capability failure. A legitimate copyright refusal quietly poisons the rest of the session. Every later image request is blocked with the same policy message, even when the content is obviously safe. Nothing in the interface explains that the issue is session scoped or that starting a new chat restores image generation.

For product teams, this is a safety and reliability tradeoff that is hard to see from aggregate metrics alone. Strong copyright protection is important, but a 96 percent false positive rate on benign follow up prompts means a single decision can take out an entire modality. The Violation State gives a concrete way to measure that tradeoff and a template for probing similar effects in other models and interfaces.

Key Ideas
  • A single copyright refusal can contaminate a conversation and shut down later image generation
  • The effect is strongly asymmetric: image prompts fail, text and code prompts keep working
  • Contaminated sessions show 116 of 120 image requests refused; controls show 0 of 40 refused
  • Refusals do not decay over 0 to 10 minutes and survive rate limits and retries
  • The behavior looks like a binary session level flag rather than a per request judgment
  • The model often admits the prompts are safe and blames the earlier request chain but cannot fix it inside the session
  • Rare breakthrough cases show that the safety state is strong but not perfectly deterministic
  • Starting a fresh conversation clears the state, and there is no evidence of account level flagging
  • The paper offers a simple, replicable protocol for testing session level safety behavior in multimodal systems
  • Safety state that over generalizes in this way quietly erodes trust and makes capabilities less usable, even when the underlying model is capable of doing the task
Open Colab → GitHub →