Five empirical investigations into model reasoning.
Each study pairs a runnable Colab with a public repo for reproducibility. From recursive non-convergence to null results in mechanistic interpretability.
Recursive Non-Convergence in Generative Reasoning Systems
Large language models often appear reflective, but are merely recursive. Turning their own answers into inputs, mistaking reformulation for progress. The Mirror Loop quantifies this non-convergence across architectures, showing that ungrounded self-critique produces motion without movement. It's the first empirical map of generative reasoning collapse and a blueprint for detecting "stalled cognition" in AI systems.
Why Reasoning Prompts Backfire and Grounding Works (Sometimes)
When language models "reflect," they often fabricate. Recursive Confabulation shows how models reuse their own fictions as evidence, creating self-reinforcing belief loops that mimic understanding. Safety interventions meant to fix this, like reasoning or audit prompts, actually worsen the problem. Grounding helps, but unevenly across architectures. The study reframes hallucination as semantic compression: rising certainty, falling truth.
Safety-State Persistence in ChatGPT's Image Generation
This study shows how a single copyright refusal can poison an entire conversation. After the model correctly refuses to remove a watermark, the session becomes contaminated and starts blocking harmless image requests that have nothing to do with the original photo. Text generation keeps working. Image generation does not. The paper shows that a hidden safety-state is being carried forward across turns, and once it is triggered, it quietly disables image generation for the rest of the session.
Fabrication, Admission, and Refusal in Frontier LLMs Without Tool Access
This study has been archived following validation that revealed a token-cap artifact. Corrected replication showed GPT-5 and Gemini exhibit similar fabrication behavior, collapsing the original three-way divergence. The methodological lessons informed the Course Correct Labs evaluation suite.
A Null Result in Mechanistic Interpretability
A reproducible benchmark that tests a high-profile claim that internal activations "collapse" during long-form generation. Using open-weight models (Phi-2, Mistral-7B), the study finds no sign of representational decay: internal geometry remains stable across hundreds of tokens. The takeaway: small models stay coherent longer than expected. Failures come from meaning, not mechanics.
