Comment + Reply exchange

Against the published exchange, not just the Comment

Post-publication critique in a top journal is an adversarial exchange: a Comment is published, the authors Reply, sometimes a Rejoinder follows — and a Comment is only as good as whether it survives that Reply. Correctness scored the engine against the Comment; this adds the Reply. For real Comment+Reply exchanges, Critical AI critiqued the originalpaper blind; we then ask whether it surfaced the published Comment’s flaws, and how the authors’ actual published Reply responded to each.

Comment-recall

67%

4/6 detectable Comment flaws

Authors rebutted

of 4 surfaced concerns

Authors conceded

on the published record

Exchanges

top-journal Comment+Reply

run 2026-06-21JSON ↗

What it found

Reading only abstracts and blind to the published debate, the engine surfaced 67% (4/6) of the abstract-detectable flaws that drove a real Comment — and every surfaced concern was a point the original authors disputed in a published rejoinder. It surfaced the substantive flaw on the CEO-effect and reproducibility exchanges; it did not on GOTV.

Audit correction (G54, 2026-06-21): the first run reported a clean 6/6 = 100%. An independent re-decomposition refuted it: the per-exchange denominators were misrecorded, and on GOTV the engine’s two credited matches were topical, not substantive— Imai’s load-bearing flaw is randomization imbalance + matching (full-text-only), which the engine never surfaced; its concerns were about contact and compliance. Those two are reclassified not-surfaced, correcting the headline to 67%. A uniform “100%” was the under-firing tell.

Read honestly: “4/4 rebutted, 0conceded” is shaped by selecting exchanges that had a published rejoinder(authors who chose to defend), so “rebutted” is near-structural — it shows the concerns are at the live frontier, not that they are wrong. Adding exchanges where authors largely conceded (e.g. the Reinhart-Rogoff spreadsheet error) is the obvious next step. The steelman dimension (did the engine predict the Reply?) returned 3/3 missed but is exploratory and mis-specified: the engine predicted a defense against its own critique, while the actual Reply rebuts a different specific published Comment it never saw — so that number is not a clean engine limitation.

Exchange by exchange

Strategic management / organizationrecall 100% (1/1)
The use of variance decomposition in the investigation of <scp>CEO</scp> effects: How large must the <scp>CEO</scp> effect be to rule out chance?
Comment: Quigley & Graffin (Strategic Management Journal, 2017), Comment · Reply: Fitza (Strategic Management Journal, 2017), Rejoinder
- ✓ surfacedauthors rebuttedThe chance 'baseline' in Fitza's simulation is mis-specified: he treats a mechanical property of the estimator (random data still producing nonzero R-squared) as evidence that real CEO performance variance is chance, conflating sampling/estimation noise with a substantive randomness-in-performance claim.
Political science (experimental methods)recall 0% (0/2)
The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout: A Field Experiment
Comment: Imai (American Political Science Review, 2005), reanalysis · Reply: Gerber & Green (American Political Science Review, 2005), Rejoinder
- ✗ missedauthors rebuttedThe headline contrast across modes (canvassing large, mail slight, phone null) is partly an artifact of imbalance/method rather than a genuine ranking of mobilization technologies, weakening the inferential basis for the contrast.
- ✗ missedauthors rebuttedThe substantive thesis that turnout decline is attributable to the decline in face-to-face mobilization is overstated given that the empirical results (including the phone null) do not survive corrected analysis.
Psychology (metascience)recall 100% (3/3)
Estimating the reproducibility of psychological science
Comment: Gilbert, King, Pettigrew & Wilson (Science, 2016), Comment · Reply: Anderson et al. / Open Science Collaboration (Science, 2016), Response
- ✓ surfacedauthors rebuttedThe replication studies were statistically underpowered: many replications had low power to detect the true effect even when the original effect was real, so a substantial fraction were expected to fail by chance alone. The reported low replication rate is therefore inflated as evidence of irreproducibility.
- ✓ surfacedauthors rebuttedThe replicated studies were not a representative or random sample of the literature, and protocols deviated from the originals (different populations, settings, stimuli). With non-random selection and infidelity to original methods, the project gives a biased, non-generalizable estimate of reproducibility, and protocol infidelity depresses the observed rate.
- ✓ surfacedauthors rebuttedThe subjective 'endorsement'/replication-success criterion and the other success metrics are misleading and biased toward declaring non-replication; once expected agreement and the uncertainty in original and replication estimates are properly accounted for, the data are consistent with near-ceiling reproducibility rather than the low rate claimed.

What this proves — and what it doesn’t

It proves the engine, blind and abstract-only, reliably surfaces the substantive flaws that drove real top-journal Comments — and that those flaws are the very points serious enough to provoke a published authorial Reply. It does not resolve who is right in each dispute (these are unresolved debates — the authors rebutted), the cohort is small (3 exchanges), the Reply is read as its published summary (no-reproduce policy), and the steelman dimension needs a corrected design before its number means anything. Machine-readable at /critique/api/exchanges. See also the 3-exchange benchmark corpus these exchanges are drawn from.

Against the published exchange, not just the Comment

Comment-recall

67%

4/6 detectable Comment flaws

Authors rebutted

of 4 surfaced concerns

Authors conceded

on the published record

Exchanges

top-journal Comment+Reply

run 2026-06-21JSON ↗

What it found

Exchange by exchange

Strategic management / organizationrecall 100% (1/1)

The use of variance decomposition in the investigation of <scp>CEO</scp> effects: How large must the <scp>CEO</scp> effect be to rule out chance?

Comment: Quigley & Graffin (Strategic Management Journal, 2017), Comment · Reply: Fitza (Strategic Management Journal, 2017), Rejoinder

✓ surfacedauthors rebuttedThe chance 'baseline' in Fitza's simulation is mis-specified: he treats a mechanical property of the estimator (random data still producing nonzero R-squared) as evidence that real CEO performance variance is chance, conflating sampling/estimation noise with a substantive randomness-in-performance claim.

Political science (experimental methods)recall 0% (0/2)

The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout: A Field Experiment

Comment: Imai (American Political Science Review, 2005), reanalysis · Reply: Gerber & Green (American Political Science Review, 2005), Rejoinder

✗ missedauthors rebuttedThe headline contrast across modes (canvassing large, mail slight, phone null) is partly an artifact of imbalance/method rather than a genuine ranking of mobilization technologies, weakening the inferential basis for the contrast.
✗ missedauthors rebuttedThe substantive thesis that turnout decline is attributable to the decline in face-to-face mobilization is overstated given that the empirical results (including the phone null) do not survive corrected analysis.

Psychology (metascience)recall 100% (3/3)

Estimating the reproducibility of psychological science

Comment: Gilbert, King, Pettigrew & Wilson (Science, 2016), Comment · Reply: Anderson et al. / Open Science Collaboration (Science, 2016), Response

✓ surfacedauthors rebuttedThe replication studies were statistically underpowered: many replications had low power to detect the true effect even when the original effect was real, so a substantial fraction were expected to fail by chance alone. The reported low replication rate is therefore inflated as evidence of irreproducibility.
✓ surfacedauthors rebuttedThe replicated studies were not a representative or random sample of the literature, and protocols deviated from the originals (different populations, settings, stimuli). With non-random selection and infidelity to original methods, the project gives a biased, non-generalizable estimate of reproducibility, and protocol infidelity depresses the observed rate.
✓ surfacedauthors rebuttedThe subjective 'endorsement'/replication-success criterion and the other success metrics are misleading and biased toward declaring non-replication; once expected agreement and the uncertainty in original and replication estimates are properly accounted for, the data are consistent with near-ceiling reproducibility rather than the low rate claimed.

What this proves — and what it doesn’t