Print-friendly view · use your browser's Save as PDF option (Cmd/Ctrl-P) to attach this article to a brief.

Scalable Oversight

scalable-oversight · safety · concept

Source: https://policywindow.org/wiki/scalable-oversight

Generated 2026-05-30T22:11:19 UTC

Summary

The set of techniques for supervising AI systems whose outputs are too complex, too numerous, or too domain-distant for unaided human evaluators to judge correctness.

At a glance

Used by: 4 instrument(s)
Related concepts: alignment, deceptive-alignment, capability-elicitation, red-team-evaluation
Primary source: Christiano, P., Shlegeris, B., Amodei, D. (2018), 'Supervising Strong Learners by Amplifying Weak Experts.'
Source URL: https://arxiv.org/abs/1810.08575

Details

Scalable oversight addresses the 'who watches the watchers' problem at AI scale. When a model produces 10⁶ outputs per day, or operates in a domain where the supervising human is not expert (e.g., novel mathematics, advanced biology), traditional human-in-the-loop review fails. Christiano et al. (2018) 'Supervising Strong Learners by Amplifying Weak Experts' is the foundational articulation. The agenda spans: (a) debate (two AIs argue, a human judges short transcripts — Irving et al. 2018); (b) iterated amplification (humans + assistants supervise stronger models, recursively — Christiano et al. 2018); (c) constitutional AI / RLAIF (rule-based or AI-feedback supervision in place of unscaled human labels — Bai et al. 2022, Anthropic); (d) weak-to-strong generalisation (Burns et al. 2023, OpenAI) — can a weak supervisor train a stronger model to behave well on tasks the weak supervisor cannot grade? Governance relevance is direct. EU AI Act Art. 14 mandates 'human oversight' for high-risk systems; the article is written assuming bandwidth-feasible human review, which scalable-oversight literature argues breaks at frontier-model scale. UK AISI red-team commitments explicitly invoke scalable-oversight techniques. NIST AI RMF Govern 1.3 calls for documented oversight mechanisms but does not specify scalability requirements. The gap between regulatory 'human oversight' language and the technical reality of supervising super-human-domain outputs is one of the field's most-discussed governance-implementation gaps.

How to cite this article

APA

Policy Window. (n.d.). Scalable Oversight [Wiki article — Concept]. https://policywindow.org/wiki/scalable-oversight

Chicago

Policy Window. n.d.. "Scalable Oversight." Wiki article (Concept). https://policywindow.org/wiki/scalable-oversight.

Harvard

Policy Window (n.d.) 'Scalable Oversight', Wiki article — Concept, available at: https://policywindow.org/wiki/scalable-oversight.

OSCOLA

Policy Window, 'Scalable Oversight' (Wiki article — Concept, n.d.) <https://policywindow.org/wiki/scalable-oversight> accessed [date].

BibTeX

@misc{policywindow-scalable-oversight,
  title  = {Scalable Oversight},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {scalable-oversight — safety},
  url    = {https://policywindow.org/wiki/scalable-oversight},
  note   = {Primary source: https://arxiv.org/abs/1810.08575}
}