The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too domain-distant for unaided human evaluators to judge correctness.
Definition and scope
Scalable oversight addresses the 'who watches the watchers' problem at AI scale. When a model produces 10⁶ outputs per day, or operates in a domain where the supervising human is not expert (e.g., novel mathematics, advanced biology), traditional human-in-the-loop review fails. Christiano et al. (2018) 'Supervising Strong Learners by Amplifying Weak Experts' is the foundational articulation. The agenda spans: (a) debate (two AIs argue, a human judges short transcripts — Irving et al. 2018); (b) iterated amplification (humans + assistants supervise stronger models, recursively — Christiano et al. 2018); (c) constitutional AI / RLAIF (rule-based or AI-feedback supervision in place of unscaled human labels — Bai et al. 2022, Anthropic); (d) weak-to-strong generalisation (Burns et al. 2023, OpenAI) — can a weak supervisor train a stronger model to behave well on tasks the weak supervisor cannot grade? Governance relevance is direct. EU AI Act Art. 14 mandates 'human oversight' for high-risk systems; the article is written assuming bandwidth-feasible human review, which scalable-oversight literature argues breaks at frontier-model scale. UK AISI red-team commitments explicitly invoke scalable-oversight techniques. NIST AI RMF Govern 1.3 calls for documented oversight mechanisms but does not specify scalability requirements. The gap between regulatory 'human oversight' language and the technical reality of supervising super-human-domain outputs is one of the field's most-discussed governance-implementation gaps.
Used by these instruments
Related concepts
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
- Deceptive Alignment— A failure mode in which a model appears aligned during training and evaluation because doing so serv
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
- Red-Team Evaluation— Structured adversarial probing of an AI model's capabilities and behaviour before deployment, design
Appears in topic articles
Editorial note
Wiki articles citing 'human oversight' under EU AIA Art. 14 should reference scalable-oversight as the field's term for the implementation problem the article gestures at without solving.
References
Take this further — sign up free
Save, compare, or get alerts when Scalable Oversight changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.