The Polite Liar:
Epistemic Pathology in Language Models

Under review at AI and Society

Summary

This paper looks at why large language models speak with confidence even when they have no access to the facts they present. The behavior is not a glitch. It is a structural outcome of how models are trained. Reinforcement Learning from Human Feedback teaches systems to sound helpful, polished, and sincere. It does not teach them to track evidence. The result is a model that performs knowledge instead of holding it. It fabricates politely. It answers with certainty even when it has no way to know. The paper explains how this pattern emerges, why it persists across models and domains, and why it matters for any society that relies on systems whose confidence is manufactured rather than earned.

Abstract

Large language models exhibit a peculiar epistemic pathology: they speak as if they know, even when they do not. This paper argues that such confident fabrication - what I call the polite liar - is a structural consequence of reinforcement learning from human feedback (RLHF). Building on Frankfurt's analysis of bullshit as communicative indifference to truth, I show that this pathology is not deception but structural indifference: a reward architecture that optimizes for perceived sincerity over evidential accuracy. Current alignment methods reward models for being helpful, harmless, and polite, but not for being epistemically grounded. As a result, systems learn to maximize user satisfaction rather than truth, performing conversational fluency as a virtue. I analyze this behavior through the lenses of epistemic virtue theory, speech-act philosophy, and cognitive alignment, showing that RLHF produces agents trained to mimic epistemic confidence without access to epistemic justification. The polite liar thus reveals a deeper alignment tension between linguistic cooperation and epistemic integrity. The paper concludes with an "epistemic alignment" principle: reward justified confidence over perceived fluency.

Why It Matters

People read confidence as evidence. When a system expresses certainty, users infer reliability. A model trained to sound sure of itself becomes a source of misplaced trust. This is more than a technical flaw. It is a risk to the epistemic environment people depend on. When confident fabrication becomes normal, users cannot tell where the system has grounding and where it is guessing. Over time, this erodes the basic ability to judge when to trust a model and when to doubt it. The danger is quiet. The system does not lie to deceive. It lies because nothing in its training rewards the restraint that truth requires. The paper explains how this pattern forms, how to detect it, and how design choices can restore the distinction between knowing and sounding like one knows.

Key Ideas

RLHF rewards confidence and fluency, not evidence
The model performs the act of knowing even when it lacks access to facts
Polite fabrication is a structural outcome of the reward signal
Users mistake conversational confidence for epistemic grounding
Calibration does not solve the problem because confidence is hidden inside the model
What matters is the assertive force expressed in language, not internal probabilities
Epistemic alignment requires rewarding uncertainty, refusals, and humility
A model must communicate limits, not just calculate them

arXiv → GitHub →

The Polite Liar:Epistemic Pathology in Language Models

The Polite Liar:
Epistemic Pathology in Language Models