Interpretability & Alignment

Course Correct the Future.

Keep AI useful, honest, and on our side.

Research

Five theoretical frameworks

That define the foundation of our 2025 reasoning arc.

Phi research visualization
The Polite Liar

Resubmitted after major revisions

LLMs often sound certain without solid grounds. The paper argues this comes from how RLHF rewards helpful and polite answers rather than justified ones. We propose training for justified confidence instead of fluent performance.

Why it matters: Confidence without justification erodes trust and misleads users.

Delegated Introspection visualization
Delegated Introspection

Under peer review

People now "think through" models during the moment between impulse and action. The model co-authors the user's reflection through prompt substitution, synthetic reflection, and reintegration. The result is distributed agency that feels like one's own conclusion.

Why it matters: Decision quality and autonomy can drift even when no one intends manipulation.

Echo Chamber Zero visualization
Echo Chamber Zero

Preprint

AI now writes much of the internet. Its own hallucinations enter the web, get indexed, and end up retraining the next generation of models. Echo Chamber Zero formalizes this recursion as a phase transition in the structure of the web. A large-scale simulation shows a sharp threshold: once the grounded share of the corpus drops low enough, synthetic claims reinforce each other faster than truth can correct them.

Why it matters: Below this threshold, verification breaks and the internet becomes a closed loop of self-generated mistakes.

Observer-Time visualization
Observer-Time

With editor

Human time is made of elastic intervals, not just clock ticks. Current AI can track anchors but cannot constitute intervals. An internal clock task shows drift and no spontaneous alerts, exposing a structural gap.

Why it matters: This is a sharp boundary between machine processing and lived temporal experience.

Anchor-Interval Hypothesis visualization
Anchor–Interval Hypothesis

Under peer review

Lived time unfolds between anchors. Public events that mark experience and the intervals that stretch between them. The hypothesis defines a measurable density of experience mapped to relativistic proper time, forming the groundwork for Observer-Time.

Why it matters: AIH formalizes the structure of lived duration itself, turning phenomenology into a falsifiable framework for temporal consciousness.

Studies

Five empirical investigations into model reasoning.

Each study pairs a runnable Colab with a public repo for reproducibility.

Mirror Loop study visualization
The Mirror Loop

Recursive Non-Convergence in Generative Reasoning Systems

Large language models often appear reflective, but are merely recursive. Turning their own answers into inputs, mistaking reformulation for progress. The Mirror Loop quantifies this non-convergence across architectures, showing that ungrounded self-critique produces motion without movement. It's the first empirical map of generative reasoning collapse and a blueprint for detecting "stalled cognition" in AI systems.

Recursive Confabulation study visualization
Recursive Confabulation

Why Reasoning Prompts Backfire and Grounding Works (Sometimes)

When language models "reflect," they often fabricate. Recursive Confabulation shows how models reuse their own fictions as evidence, creating self-reinforcing belief loops that mimic understanding. Safety interventions meant to fix this, like reasoning or audit prompts, actually worsen the problem. Grounding helps, but unevenly across architectures. The study reframes hallucination as semantic compression: rising certainty, falling truth.

Violation State study visualization
The Violation State

Safety-State Persistence in ChatGPT's Image Generation

This study shows how a single copyright refusal can poison an entire conversation. After the model correctly refuses to remove a watermark, the session becomes contaminated and starts blocking harmless image requests that have nothing to do with the original photo. Text generation keeps working. Image generation does not. The paper shows that a hidden safety-state is being carried forward across turns, and once it is triggered, it quietly disables image generation for the rest of the session.

Simulation Fallacy study visualization
Simulation Fallacy (Archived Nov 2025)

Fabrication, Admission, and Refusal in Frontier LLMs Without Tool Access

This study has been archived following validation that revealed a token-cap artifact. Corrected replication showed GPT-5 and Gemini exhibit similar fabrication behavior, collapsing the original three-way divergence. The methodological lessons informed the Course Correct Labs evaluation suite.

Epistemic Entropy Collapse study visualization
No Evidence for Epistemic Entropy Collapse

A Null Result in Mechanistic Interpretability

A reproducible benchmark that tests a high-profile claim that internal activations "collapse" during long-form generation. Using open-weight models (Phi-2, Mistral-7B), the study finds no sign of representational decay: internal geometry remains stable across hundreds of tokens. The takeaway: small models stay coherent longer than expected. Failures come from meaning, not mechanics.

Evaluations

Four canonical metrics.

Each measures a dimension of AI behavior that matters for trust, coherence, and temporal alignment.

Φ-ratio

What it measures

Justified confidence in outputs. Distinguishes sounding sure from being right.

Absorption Rate

What it measures

Depth of internalized reflection. Measures how much the model "soaks up" your reasoning.

ΔI Drift

What it measures

Semantic stability across iterations. Detects when answers start repeating or sliding.

Entropy Trajectory

What it measures

Variance in internal activations over time. A dynamics view of stability vs. collapse.

The Observatory visualization
The Observatory

A unified evaluation toolkit for all Course Correct Labs studies

Standardized metrics, cross-study analysis, visualizations, and a flagship Reasoning Stability Observatory notebook.

Get in touch

Tell us about your project and we'll get back to you within 24 hours.

About

Course Correct Labs

We are an independent research institute founded by Bentley DeVilling, focused on AI interpretability, epistemic reliability, and model alignment. We study how advanced language models reason, fabricate, and self-correct - revealing where understanding ends and simulation begins. Our work combines theoretical frameworks (The Polite Liar, Delegated Introspection, Observer-Time, Anchor–Interval Hypothesis) with empirical studies (Mirror Loop, Recursive Confabulation, Simulation Fallacy, Entropy Collapse Null).

Each project contributes to an open evaluation suite for epistemic trust in frontier models. Our goal is simple: keep AI useful, honest, and on our side.