Comment + Reply exchange
Against the published exchange, not just the Comment
Post-publication critique in a top journal is an adversarial exchange: a Comment is published, the authors Reply, sometimes a Rejoinder follows — and a Comment is only as good as whether it survives that Reply. Correctness scored the engine against the Comment; this adds the Reply. For real Comment+Reply exchanges, Critical AI critiqued the originalpaper blind; we then ask whether it surfaced the published Comment’s flaws, and how the authors’ actual published Reply responded to each.
What it found
Reading only abstracts and blind to the published debate, the engine surfaced 67% (4/6) of the abstract-detectable flaws that drove a real Comment — and every surfaced concern was a point the original authors disputed in a published rejoinder. It surfaced the substantive flaw on the CEO-effect and reproducibility exchanges; it did not on GOTV.
Audit correction (G54, 2026-06-21): the first run reported a clean 6/6 = 100%. An independent re-decomposition refuted it: the per-exchange denominators were misrecorded, and on GOTV the engine’s two credited matches were topical, not substantive— Imai’s load-bearing flaw is randomization imbalance + matching (full-text-only), which the engine never surfaced; its concerns were about contact and compliance. Those two are reclassified not-surfaced, correcting the headline to 67%. A uniform “100%” was the under-firing tell.
Read honestly: “4/4 rebutted, 0conceded” is shaped by selecting exchanges that had a published rejoinder(authors who chose to defend), so “rebutted” is near-structural — it shows the concerns are at the live frontier, not that they are wrong. Adding exchanges where authors largely conceded (e.g. the Reinhart-Rogoff spreadsheet error) is the obvious next step. The steelman dimension (did the engine predict the Reply?) returned 3/3 missed but is exploratory and mis-specified: the engine predicted a defense against its own critique, while the actual Reply rebuts a different specific published Comment it never saw — so that number is not a clean engine limitation.
Exchange by exchange
- Strategic management / organizationrecall 100% (1/1)
The use of variance decomposition in the investigation of <scp>CEO</scp> effects: How large must the <scp>CEO</scp> effect be to rule out chance?
Comment: Quigley & Graffin (Strategic Management Journal, 2017), Comment · Reply: Fitza (Strategic Management Journal, 2017), Rejoinder
- ✓ surfacedauthors rebuttedThe chance 'baseline' in Fitza's simulation is mis-specified: he treats a mechanical property of the estimator (random data still producing nonzero R-squared) as evidence that real CEO performance variance is chance, conflating sampling/estimation noise with a substantive randomness-in-performance claim.
- Political science (experimental methods)recall 0% (0/2)
The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout: A Field Experiment
Comment: Imai (American Political Science Review, 2005), reanalysis · Reply: Gerber & Green (American Political Science Review, 2005), Rejoinder
- ✗ missedauthors rebuttedThe headline contrast across modes (canvassing large, mail slight, phone null) is partly an artifact of imbalance/method rather than a genuine ranking of mobilization technologies, weakening the inferential basis for the contrast.
- ✗ missedauthors rebuttedThe substantive thesis that turnout decline is attributable to the decline in face-to-face mobilization is overstated given that the empirical results (including the phone null) do not survive corrected analysis.
- Psychology (metascience)recall 100% (3/3)
Estimating the reproducibility of psychological science
Comment: Gilbert, King, Pettigrew & Wilson (Science, 2016), Comment · Reply: Anderson et al. / Open Science Collaboration (Science, 2016), Response
- ✓ surfacedauthors rebuttedThe replication studies were statistically underpowered: many replications had low power to detect the true effect even when the original effect was real, so a substantial fraction were expected to fail by chance alone. The reported low replication rate is therefore inflated as evidence of irreproducibility.
- ✓ surfacedauthors rebuttedThe replicated studies were not a representative or random sample of the literature, and protocols deviated from the originals (different populations, settings, stimuli). With non-random selection and infidelity to original methods, the project gives a biased, non-generalizable estimate of reproducibility, and protocol infidelity depresses the observed rate.
- ✓ surfacedauthors rebuttedThe subjective 'endorsement'/replication-success criterion and the other success metrics are misleading and biased toward declaring non-replication; once expected agreement and the uncertainty in original and replication estimates are properly accounted for, the data are consistent with near-ceiling reproducibility rather than the low rate claimed.
What this proves — and what it doesn’t
It proves the engine, blind and abstract-only, reliably surfaces the substantive flaws that drove real top-journal Comments — and that those flaws are the very points serious enough to provoke a published authorial Reply. It does not resolve who is right in each dispute (these are unresolved debates — the authors rebutted), the cohort is small (3 exchanges), the Reply is read as its published summary (no-reproduce policy), and the steelman dimension needs a corrected design before its number means anything. Machine-readable at /critique/api/exchanges. See also the 3-exchange benchmark corpus these exchanges are drawn from.