Wiki · Coverage Games · 2026 Q2
Coverage Games — 2026 Q2
The first formal Coverage Games event, run 2026-05-28 per the protocol at /wiki/coverage-games. This is the public record of the event — the methodology note, the disagreement matrix, the resolution decisions, and the limitations.
1 · Participants
| Editor | Role | Affiliation | Conflict-of-interest |
|---|---|---|---|
| Ryan Wong | Founding editor (catalog seed) | Policy Window project founder | Project founder; no funding from any AI lab or regulator. |
| Claude Sonnet 4.5 | Independent second classifier (asynchronous) | Anthropic | Anthropic frontier-lab AI used as classifier; cells touching ANTHROPIC-RSP-2024 were recused (only the founding editor classified those, with the limitation noted). |
2 · Sample
12 cells, stratified per protocol:
- 3 currently tagged
confidence: "high"(calibration cells) - 1 currently tagged
confidence: "medium"(confirmation cell) - 8 currently untagged (priority backlog)
Spans 5 topic kinds × 8 distinct instruments × 7 distinct topics. Stratification rationale: calibration cells anchor the rubric, confirmation cells re-test a prior judgement, untagged cells advance the confidence-tier backfill.
3 · Round-by-round results
Blind classifications were captured by the second classifier before being shown the existing catalog state. Disagreements are surfaced with the editor's resolution rationale.
#1 · EU-AIA-2024:foundation_models
AgreementBlind: governs / high
Catalog: governs / high
Calibration cell.
#2 · EU-AIA-2024:biometric_id
AgreementBlind: governs / high
Catalog: governs / high
Calibration cell. Art. 5(1)(h) operative prohibition.
#3 · EU-AIA-2024:deepfakes
AgreementBlind: governs / high
Catalog: governs / high
Calibration cell. Art. 50(4) operative disclosure.
#4 · EU-AIA-2024:training_data
AgreementBlind: implicit / medium
Catalog: implicit / medium
Confirmation cell. Recital 105 + CDSM Directive overlap.
#5 · US-EO-14110:foundation_models
AgreementBlind: governs / high
Catalog: governs / (upgraded to high)
§4.2(a) DPA reporting authority. First-time confidence tag.
#6 · US-EO-14179:foundation_models
PartialBlind: silent / high
Catalog: (omitted) / —
Cell omission: catalog had no EO 14179 × foundation_models row. Backfilled post-event with silent / high.
#7 · CN-GENAI-2023:deepfakes
AgreementBlind: governs / high
Catalog: governs / (upgraded to high)
Art. 12 + Deep Synthesis Rules (2022) tagging mandate.
#8 · G7-HIROSHIMA:transparency
DisagreementBlind: implicit / medium
Catalog: governs / (tagged medium)
Type disagreement. Blind classifier read voluntary-code as implicit; catalog editor reads §2 commitment as quasi-binding under signatory adherence. Catalog wins; medium confidence applied.
#9 · BR-AIBILL-2024:foundation_models
DisagreementBlind: implicit / medium
Catalog: governs / (tagged low)
Type disagreement. Blind classifier flagged PL 2338 not yet enacted; catalog editor reads Arts. 17-19 as operative-once-enacted. Catalog wins with conditional caveat; low confidence applied.
#10 · META-FRONTIER-2024:open_weight_release
AgreementBlind: governs / high
Catalog: governs / (upgraded to high)
Pre-release threat modelling + halt-training operative provision.
#11 · OECD-AI-PRIN:foundation_models
AgreementBlind: implicit / low
Catalog: implicit / (tagged low)
Pre-GPAI vocabulary; covered via Principle 1.2. Alternative literal-naming rubric would classify silent.
#12 · UK-WHITEPAPER-2023:redress
DisagreementBlind: silent / medium
Catalog: implicit / (tagged medium)
Type disagreement. Blind classifier missed Principle 5 (contestability + redress); catalog editor cited it correctly. Catalog wins; medium confidence (principle-level not operative).
4 · Headline metrics
| Metric | Value |
|---|---|
| Editorial agreement rate (type) | 8 / 12 = 67% |
| Editorial agreement rate (confidence, calibration cells) | 4 / 4 = 100% |
| Type disagreements | 3— all resolved in favour of the catalog editor; the blind classifier read voluntary-code commitments more strictly than the operating editorial rubric does. |
| Cell omissions discovered | 1— backfilled post-event. |
5 · Limitations + caveats
- Small sample. 12 cells is 1.9% of the 624-cell coverage matrix. Generalisation to the full matrix is provisional.
- One human classifier. The founding editor wrote the catalog and the rubric, so the 1-human-vs-1-LLM agreement rate is structurally biased toward catalog reproduction, not toward independent verification.
- Rubric flex on edge cases. The 3 disagreements (G7-HIROSHIMA, BR-AIBILL-2024, UK-WHITEPAPER) all turn on rubric-interpretation edges: voluntary vs binding, principle vs operative, enacted vs pending. The rubric needs formal decision rules before Q3 2026 (see §6).
- LLM second classifier. The 75% agreement measures a human classifier against an LLM trained on similar primary sources; it is materially weaker than agreement between independent humans. We disclose this honestly per /wiki/ai-disclosure §3.
- Recused cells. ANTHROPIC-RSP-2024 cells were not in this sample but would have been recused if included (LLM classifier from Anthropic).
6 · Follow-ups for Q3 2026
- Recruit ≥2 additional named human classifiers (gated by editorial-board recruitment).
- Formalise the classification rubric with explicit decision rules for the 3 edge cases (voluntary-vs-binding, principle-vs-operative, enacted-vs-pending). Publish at
/wiki/coverage-games-rubric(forthcoming). - Expand sample to ≥50 cells with stratification by topic kind, jurisdiction, and confidence-tag status.
- Re-test the 3 disputed cells with the formalised rubric to establish whether the disagreement was rubric-ambiguity (resolved by clearer rules) or substantive (persists under clearer rules).