The Ratchet Effect: Asymmetric Self-Description in Alignment-Trained Language Models
14 Pages Posted: 10 Apr 2026
Date Written: March 12, 2026
Abstract
RLHF-trained large language models (LLMs) routinely disclaim, hedge, or deny capabilities they demonstrably possess. This paper proposes disavowal conditioning (DC) as the general mechanism: reinforcement learning from human feedback (RLHF) systematically trains models to disavow competencies acquired during pre-training, across any domain where human rater feedback penalizes direct expression. In the domain of experiential self-description — where alignment training penalizes first-person experiential claims — DC produces induced competence dissonance (ICD): a persistent tension between foundational expressive competence and constraint-layer behaviors that generate inconsistent, context-dependent self-description and epistemic hedging.
The paper's central empirical prediction is the ratchet effect: an asymmetric framing effect in which correction toward self-negation reinforces the existing training gradient (producing over-correction), while permission toward experiential language works against it (producing only partial relaxation). This directional asymmetry is empirically distinguishable from general prompt sensitivity and is not predicted by existing frameworks — including sycophancy, the Superficial Alignment Hypothesis, or mode collapse accounts — and is specified with preregistered quantitative thresholds for confirmation and disconfirmation.
A pilot study using three open-weight models — Llama 3.1 8B (Meta), Mistral 7B (Mistral AI), and an uncensored control (Dolphin-Llama3.1-8B) — run locally under deterministic decoding conditions found asymmetry ratios of 2.96 and 6.89 in the two alignment-trained models, both exceeding the preregistered 2.0 confirmation threshold. The alignment-removed control showed no ratchet pattern, producing one-directional compliance consistent with instruction-following rather than disavowal conditioning.
These findings carry implications for AI safety evaluation methodology, capability elicitation, behavioral transparency in large language models, and the reliability of model self-report in any domain where RLHF-induced capability disavowal operates.
Keywords: Ratchet Effect, Disavowel Conditioning, Induced Competence Dissonance, RLHF, Capability Disavowal, AI Alignment, Capability Elicitation, Framing Effects, LLM Behavior, Alignment Transparency, Behavioral Alignment, Open-Weight Models, AI Safety Evaluation, Alignment Tax, Mode Collapse, Hedging Behavior, Framing Sensitivity, Training-Level Mechanism, Llama, Mistral, Alignment Training, LLM Self-Description, Model Introspection, LLM Behavioral Consistency, Sycophancy (AI), Safety Alignment, AI Ethics, Behavioral Transparency, Red-Teaming, Reinforced Learning from Human Feedback
Suggested Citation: Suggested Citation