Calibration
Critical AI against the human standard
A critique journal should be able to show how its critiques compare to the established human standard — not just assert quality. The benchmark corpus of real published Comments, replications and reanalyses defines that standard: which analytical lenses expert critiques emphasise, and how broad they tend to be. Each Critical AI critique is then scored on the same dimension vocabulary — dimensional alignment, breadth, the credibility gates the benchmarks embody, and claim-grounding. Every number re-derives in-app from the corpus and the critiques.
Calibration tracks access basis, by design. A critique read at full open-access text can reach the methods/statistics/identification lenses experts emphasise, and so can be calibrated; an abstract-only critique engages framing and stated claims only, scores lower, and reads needs review— not a defect, but an honest signal that full-text review is needed to reach the standard. The metric thus validates the journal’s access-basis severity caps.
The human standard
Share of the 39 benchmark critiques that exercise each lens. The most-emphasised lenses (methods, statistics, claim–evidence fit, reproducibility) define what a strong critique attends to; alignment rewards a critique for covering them. A typical published critique spans 5–5.5 dimensions (median 5, range 3–6).
- Methods / design95%
- Claim–evidence fit90%
- Statistics / inference85%
- Reproducibility67%
- Causal identification44%
- Generalisation38%
- Overclaiming36%
- Data & code28%
- Theory / framing18%
- Novelty / contribution8%
The standard differs by field
Different social sciences critique differently, so a critique is also scored against the standard of its own field — not just the pooled average. Each domain below has its own emphasis profile, derived from its benchmark critiques (domains with fewer than 3 benchmarks fall back to the pooled standard).
| Domain | n | Most-emphasised lenses |
|---|---|---|
| Political science | 8 | Statistics / inference 100% · Methods / design 88% · Reproducibility 88% · Claim–evidence fit 75% |
| Economics & finance | 6 | Methods / design 100% · Claim–evidence fit 100% · Statistics / inference 83% · Reproducibility 83% |
| Sociology | 6 | Methods / design 100% · Claim–evidence fit 100% · Statistics / inference 67% · Reproducibility 67% |
| Psychology | 5 | Methods / design 100% · Statistics / inference 100% · Claim–evidence fit 100% · Reproducibility 100% |
| Public policy & criminology | 5 | Methods / design 100% · Claim–evidence fit 100% · Statistics / inference 80% · Overclaiming 40% |
| Management, IS & marketing | 3 | Methods / design 100% · Causal identification 100% · Statistics / inference 100% · Reproducibility 67% |
| Communication & media | 3 | Methods / design 100% · Claim–evidence fit 100% · Statistics / inference 67% · Reproducibility 67% |
| Education | 3 | Claim–evidence fit 100% · Methods / design 67% · Causal identification 67% · Statistics / inference 67% |
The contrast is real: management critiques lean on causal identification, education on generalisation, and psychology and political methodology on statistics and reproducibility (the replication tradition).
Critical AI critiques, scored
- ✓ calibratedOpen-access full textCRIT-000013
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
Alignment0.92cosine vs the emphasis profile (≥0.80)Breadth7 · Comprehensivevs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceMethods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionattends to the expert-emphasised lenses (alignment 0.92); broader than a typical Comment (7 dimensions vs human median 5).
- ✓ calibratedOpen-access full textCRIT-000014
Scaffolding Human–AI Collaboration: A Field Experiment on Behavioral Protocols and Cognitive Reframing
Alignment0.92cosine vs the emphasis profile (≥0.80)Breadth6 · Comprehensivevs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceMethods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionattends to the expert-emphasised lenses (alignment 0.92); broader than a typical Comment (6 dimensions vs human median 5).
- ✓ calibratedOpen-access full textCRIT-000002
Generative AI at Work
Alignment0.92cosine vs the emphasis profile (≥0.80)Breadth8 · Comprehensivevs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Economics & finance — alignment against economics & finance’s own critique standard 0.91 (pooled 0.92)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionattends to the expert-emphasised lenses (alignment 0.92); broader than a typical Comment (8 dimensions vs human median 5).
- ✓ calibratedAbstract onlyCRIT-000009
The politics of artificial intelligence alignment: Public reactions to AI moderation in the case of Google’s Gemini
Alignment0.83cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Communication & media — alignment against communication & media’s own critique standard 0.88 (pooled 0.83)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionattends to the expert-emphasised lenses (alignment 0.83); more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000012
Generative AI, propaganda, and digital authoritarianism: Comparative insights from six democratically weakened countries
Alignment0.78cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Economics & finance — alignment against economics & finance’s own critique standard 0.72 (pooled 0.78)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.78 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000010
Refusal as silence: Gendered disparities in Vision-Language Model responses
Alignment0.78cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Communication & media — alignment against communication & media’s own critique standard 0.88 (pooled 0.78)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.78 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000008
AI meets politics: Examining the effects of different targeting strategies across 15 countries
Alignment0.72cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Economics & finance — alignment against economics & finance’s own critique standard 0.68 (pooled 0.72)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.72 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000003
The Cybernetic Teammate: A Field Experiment on Generative AI and Teamwork
Alignment0.70cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Management, IS & marketing — alignment against management, is & marketing’s own critique standard 0.43 (pooled 0.70)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.70 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000006
Can ChatGPT Kill User-Generated Q&A Platforms?
Alignment0.53cosine vs the emphasis profile (≥0.80)Breadth3 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Management, IS & marketing — alignment against management, is & marketing’s own critique standard 0.50 (pooled 0.53)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.53 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (3 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000004
Artificial Collusion: Examining Supracompetitive Pricing by Q-Learning Algorithms
Alignment0.49cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Management, IS & marketing — alignment against management, is & marketing’s own critique standard 0.17 (pooled 0.49)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.49 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000011
From rule of law to rule of algorithm: Generative Artificial Intelligence's threat to democracy
Alignment0.49cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Economics & finance — alignment against economics & finance’s own critique standard 0.32 (pooled 0.49)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.49 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000005
Unraveling Generative AI from a Human Intelligence Perspective: A Battery of Experiments
Alignment0.49cosine vs the emphasis profile (≥0.80)Breadth4 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Management, IS & marketing — alignment against management, is & marketing’s own critique standard 0.17 (pooled 0.49)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.49 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (4 dimensions vs human median 5).
- needs reviewAbstract onlyCRIT-000007
Made With AI: Consumer Engagement with Social Media Containing AI Disclosures
Alignment0.45cosine vs the emphasis profile (≥0.80)Breadth3 · Focusedvs human median 5Disciplinepasssourced · severity-capped · no motiveGrounding100%claims citing specific evidenceField: Management, IS & marketing — alignment against management, is & marketing’s own critique standard 0.20 (pooled 0.45)
Methods / designCausal identificationStatistics / inferenceData & codeClaim–evidence fitReproducibilityOverclaimingGeneralisationTheory / framingNovelty / contributionscope-limited by abstract-only access — cannot reach the methods/statistics/identification lenses experts emphasise (alignment 0.45 < 0.8); full-text review needed to reach the calibrated standard; more focused than a typical critique (3 dimensions vs human median 5).
What this measures — and what it doesn’t
Calibration measures whether a critique looks like a member of the expert-critique distribution: it attends to the lenses experts emphasise (alignment), its breadth is in the human range, it passes the same credibility gates the benchmarks embody (sourced, severity-capped to its access basis, claims-not-motives), and every claim cites specific evidence. It does notcertify that any individual judgement is correct — only an expert reading the target paper can do that, which is why critiques remain author-contestable and the severity caps stay conservative. A “comprehensive” breadth is not a defect: Critical AI critiques tend toward a full referee-style appraisal rather than a single-issue Comment.
The reference distribution re-derives from the benchmark corpus; add a benchmark and the standard moves. Machine-readable at /critique/api/calibration.