Wiki · Coverage Games · 2026 Q2

Coverage Games 2026 Q2 — process documentation, not a citable reliability estimate

The first formal Coverage Games event, run 2026-05-28 per the protocol at /wiki/coverage-games. This is the public record of the event — the methodology note, the disagreement matrix, the resolution decisions, and the limitations.

Read this first — not a reliability estimate. The event used 1 human (the founding editor) + 1 LLM classifier (Claude Sonnet 4.5). This is a process shake-out, not a peer-review-grade inter-rater reliability estimate — the protocol target is 3–5 independent expert classifiers. Iter-312 (wave-8 newsroom- standards finding G3) made this caveat the headline rather than the footnote: any number reported on this page is a methodology-development artefact, not a number a researcher should cite as the catalog's reliability metric. The next event (Q3 2026, gated by editorial-board recruitment per /wiki/editorial-board) targets ≥3 named human classifiers + ≥50 cells; only that run will produce a citable reliability figure.

1 · Participants

Editor	Role	Affiliation	Conflict-of-interest
Ryan Wong	Founding editor (catalog seed)	Policy Window project founder	Project founder; no funding from any AI lab or regulator.
Claude Sonnet 4.5	Independent second classifier (asynchronous)	Anthropic	Anthropic frontier-lab AI used as classifier; cells touching ANTHROPIC-RSP-2024 were recused (only the founding editor classified those, with the limitation noted).

2 · Sample

12 cells, stratified per protocol:

3 currently tagged confidence: "high" (calibration cells)
1 currently tagged confidence: "medium" (confirmation cell)
8 currently untagged (priority backlog)

Spans 5 topic kinds × 8 distinct instruments × 7 distinct topics. Stratification rationale: calibration cells anchor the rubric, confirmation cells re-test a prior judgement, untagged cells advance the confidence-tier backfill.

3 · Round-by-round results

Blind classifications were captured by the second classifier before being shown the existing catalog state. Disagreements are surfaced with the editor's resolution rationale.

#1 · EU-AIA-2024:foundation_models

Agreement

Blind: governs / high

Catalog: governs / high

Calibration cell.

#2 · EU-AIA-2024:biometric_id

Agreement

Blind: governs / high

Catalog: governs / high

Calibration cell. Art. 5(1)(h) operative prohibition.

#3 · EU-AIA-2024:deepfakes

Agreement

Blind: governs / high

Catalog: governs / high

Calibration cell. Art. 50(4) operative disclosure.

#4 · EU-AIA-2024:training_data

Agreement

Blind: implicit / medium

Catalog: implicit / medium

Confirmation cell. Recital 105 + CDSM Directive overlap.

#5 · US-EO-14110:foundation_models

Agreement

Blind: governs / high

Catalog: governs / (upgraded to high)

§4.2(a) DPA reporting authority. First-time confidence tag.

#6 · US-EO-14179:foundation_models

Partial

Blind: silent / high

Catalog: (omitted) / —

Cell omission: catalog had no EO 14179 × foundation_models row. Backfilled post-event with silent / high.

#7 · CN-GENAI-2023:deepfakes

Agreement

Blind: governs / high

Catalog: governs / (upgraded to high)

Art. 12 + Deep Synthesis Rules (2022) tagging mandate.

#8 · G7-HIROSHIMA:transparency

Disagreement

Blind: implicit / medium

Catalog: governs / (tagged medium)

Type disagreement. Blind classifier read voluntary-code as implicit; catalog editor reads §2 commitment as quasi-binding under signatory adherence. Catalog wins; medium confidence applied.

#9 · BR-AIBILL-2024:foundation_models

Disagreement

Blind: implicit / medium

Catalog: governs / (tagged low)

Type disagreement. Blind classifier flagged PL 2338 not yet enacted; catalog editor reads Arts. 17-19 as operative-once-enacted. Catalog wins with conditional caveat; low confidence applied.

#10 · META-FRONTIER-2024:open_weight_release

Agreement

Blind: governs / high

Catalog: governs / (upgraded to high)

Pre-release threat modelling + halt-training operative provision.

#11 · OECD-AI-PRIN:foundation_models

Agreement

Blind: implicit / low

Catalog: implicit / (tagged low)

Pre-GPAI vocabulary; covered via Principle 1.2. Alternative literal-naming rubric would classify silent.

#12 · UK-WHITEPAPER-2023:redress

Disagreement

Blind: silent / medium

Catalog: implicit / (tagged medium)

Type disagreement. Blind classifier missed Principle 5 (contestability + redress); catalog editor cited it correctly. Catalog wins; medium confidence (principle-level not operative).

4 · Headline metrics

Metric	Value
Editorial agreement rate (type)	8 / 12 = 67%
Editorial agreement rate (confidence, calibration cells)	4 / 4 = 100%
Type disagreements	3— all resolved in favour of the catalog editor; the blind classifier read voluntary-code commitments more strictly than the operating editorial rubric does.
Cell omissions discovered	1— backfilled post-event.

5 · Limitations + caveats

Small sample. 12 cells is 1.9% of the 624-cell coverage matrix. Generalisation to the full matrix is provisional.
One human classifier. The founding editor wrote the catalog and the rubric, so the 1-human-vs-1-LLM agreement rate is structurally biased toward catalog reproduction, not toward independent verification.
Rubric flex on edge cases. The 3 disagreements (G7-HIROSHIMA, BR-AIBILL-2024, UK-WHITEPAPER) all turn on rubric-interpretation edges: voluntary vs binding, principle vs operative, enacted vs pending. The rubric needs formal decision rules before Q3 2026 (see §6).
LLM second classifier. The 75% agreement measures a human classifier against an LLM trained on similar primary sources; it is materially weaker than agreement between independent humans. We disclose this honestly per /wiki/ai-disclosure §3.
Recused cells. ANTHROPIC-RSP-2024 cells were not in this sample but would have been recused if included (LLM classifier from Anthropic).

6 · Follow-ups for Q3 2026

Recruit ≥2 additional named human classifiers (gated by editorial-board recruitment).
Formalise the classification rubric with explicit decision rules for the 3 edge cases (voluntary-vs-binding, principle-vs-operative, enacted-vs-pending). Publish at /wiki/coverage-games-rubric (forthcoming).
Expand sample to ≥50 cells with stratification by topic kind, jurisdiction, and confidence-tag status.
Re-test the 3 disputed cells with the formalised rubric to establish whether the disagreement was rubric-ambiguity (resolved by clearer rules) or substantive (persists under clearer rules).

Coverage Games 2026 Q2 — process documentation, not a citable reliability estimate

1 · Participants

Editor

Role

Affiliation

Conflict-of-interest

Ryan Wong

Founding editor (catalog seed)

Policy Window project founder

Project founder; no funding from any AI lab or regulator.

Claude Sonnet 4.5

Independent second classifier (asynchronous)

Anthropic

Anthropic frontier-lab AI used as classifier; cells touching ANTHROPIC-RSP-2024 were recused (only the founding editor classified those, with the limitation noted).

2 · Sample

12 cells, stratified per protocol:

3 currently tagged confidence: "high" (calibration cells)

1 currently tagged confidence: "medium" (confirmation cell)

8 currently untagged (priority backlog)

3 · Round-by-round results

Blind classifications were captured by the second classifier before being shown the existing catalog state. Disagreements are surfaced with the editor's resolution rationale.

#1 · EU-AIA-2024:foundation_models

Agreement

Blind: governs / high

Catalog: governs / high

Calibration cell.

#2 · EU-AIA-2024:biometric_id

Agreement

Blind: governs / high

Catalog: governs / high

Calibration cell. Art. 5(1)(h) operative prohibition.

#3 · EU-AIA-2024:deepfakes

Agreement

Blind: governs / high

Catalog: governs / high

Calibration cell. Art. 50(4) operative disclosure.

#4 · EU-AIA-2024:training_data

Agreement

Blind: implicit / medium

Catalog: implicit / medium

Confirmation cell. Recital 105 + CDSM Directive overlap.

#5 · US-EO-14110:foundation_models

Agreement

Blind: governs / high

Catalog: governs / (upgraded to high)

§4.2(a) DPA reporting authority. First-time confidence tag.

#6 · US-EO-14179:foundation_models

Partial

Blind: silent / high

Catalog: (omitted) / —

Cell omission: catalog had no EO 14179 × foundation_models row. Backfilled post-event with silent / high.

#7 · CN-GENAI-2023:deepfakes

Agreement

Blind: governs / high

Catalog: governs / (upgraded to high)

Art. 12 + Deep Synthesis Rules (2022) tagging mandate.

#8 · G7-HIROSHIMA:transparency

Disagreement

Blind: implicit / medium

Catalog: governs / (tagged medium)

Type disagreement. Blind classifier read voluntary-code as implicit; catalog editor reads §2 commitment as quasi-binding under signatory adherence. Catalog wins; medium confidence applied.

#9 · BR-AIBILL-2024:foundation_models

Disagreement

Blind: implicit / medium

Catalog: governs / (tagged low)

Type disagreement. Blind classifier flagged PL 2338 not yet enacted; catalog editor reads Arts. 17-19 as operative-once-enacted. Catalog wins with conditional caveat; low confidence applied.

#10 · META-FRONTIER-2024:open_weight_release

Agreement

Blind: governs / high

Catalog: governs / (upgraded to high)

Pre-release threat modelling + halt-training operative provision.

#11 · OECD-AI-PRIN:foundation_models

Agreement

Blind: implicit / low

Catalog: implicit / (tagged low)

Pre-GPAI vocabulary; covered via Principle 1.2. Alternative literal-naming rubric would classify silent.

#12 · UK-WHITEPAPER-2023:redress

Disagreement

Blind: silent / medium

Catalog: implicit / (tagged medium)

Type disagreement. Blind classifier missed Principle 5 (contestability + redress); catalog editor cited it correctly. Catalog wins; medium confidence (principle-level not operative).

4 · Headline metrics

Metric

Value

Editorial agreement rate (type)

8 / 12 = 67%

Editorial agreement rate (confidence, calibration cells)

4 / 4 = 100%

Type disagreements

3— all resolved in favour of the catalog editor; the blind classifier read voluntary-code commitments more strictly than the operating editorial rubric does.

Cell omissions discovered

1— backfilled post-event.

5 · Limitations + caveats

Small sample. 12 cells is 1.9% of the 624-cell coverage matrix. Generalisation to the full matrix is provisional.

One human classifier. The founding editor wrote the catalog and the rubric, so the 1-human-vs-1-LLM agreement rate is structurally biased toward catalog reproduction, not toward independent verification.

Rubric flex on edge cases. The 3 disagreements (G7-HIROSHIMA, BR-AIBILL-2024, UK-WHITEPAPER) all turn on rubric-interpretation edges: voluntary vs binding, principle vs operative, enacted vs pending. The rubric needs formal decision rules before Q3 2026 (see §6).

LLM second classifier. The 75% agreement measures a human classifier against an LLM trained on similar primary sources; it is materially weaker than agreement between independent humans. We disclose this honestly per /wiki/ai-disclosure §3.

Recused cells. ANTHROPIC-RSP-2024 cells were not in this sample but would have been recused if included (LLM classifier from Anthropic).

6 · Follow-ups for Q3 2026

Recruit ≥2 additional named human classifiers (gated by editorial-board recruitment).

Formalise the classification rubric with explicit decision rules for the 3 edge cases (voluntary-vs-binding, principle-vs-operative, enacted-vs-pending). Publish at /wiki/coverage-games-rubric (forthcoming).

Expand sample to ≥50 cells with stratification by topic kind, jurisdiction, and confidence-tag status.

Re-test the 3 disputed cells with the formalised rubric to establish whether the disagreement was rubric-ambiguity (resolved by clearer rules) or substantive (persists under clearer rules).