Wiki · Coverage Games
Coverage Games — replicability protocol
Quarterly events where multiple classifiers independently rate a stratified sample of coverage cells, then compare calls. The inter-rater agreement rate is the operational measurement for the "Replicable" leg of the Three Rs framework on /wiki/methodology §0. Modelled on the Institute for Replication replication-games protocol.
Event history
- Complete
Q2 2026
2026-05-28
First formal event. 12-cell stratified sample. 1 human + 1 LLM classifier. 75% inter-rater agreement on type. Process shake-out before editorial board recruitment.
Full record: /wiki/coverage-games/2026-q2
- Gated by board recruitment
Q3 2026
Scheduled September 2026
Target ≥50 cells with ≥3 named human classifiers. Gated by editorial-board recruitment (currently 1 of 6 slots filled).
- Gated by board recruitment
Q4 2026
Scheduled December 2026
Target ≥100 cells with editorial board fully in formation. First event whose result is expected to be peer-review defensible.
Protocol summary
- Sampling: stratified sample of the coverage matrix (mix of high-confidence, medium-confidence, and untagged cells; spans multiple topic kinds and jurisdictions).
- Blind classification: each classifier reads the primary source and produces a type + confidence call before seeing the existing catalog state.
- Calibration: blind calls are compared against the catalog. Disagreements get a written resolution rationale.
- Public record:the full event — participants, COI disclosures, blind calls, resolutions, metrics, limitations — is published as a wiki article (e.g., /wiki/coverage-games/2026-q2).
- Follow-ups: the catalog is updated based on resolutions; rubric ambiguities surfaced by disagreements are added to a forthcoming rubric document.
Full protocol document: docs/coverage-games-process.md in the public repository.