Correctness
Is Critical AI right — not just faithful?
The journal proves its critiques are honest — every span is grounded, zero citations are fabricated, severity is capped to access. Calibration proves they look like expert critiques. Neither proves a critique is correct. This page is the first evidence on that. Each target below is a paper that already has an authoritative published human critique — a Comment, replication, or reanalysis in a top journal. Critical AI critiqued each one reading only its title and abstract, blind to the expert verdict; the question is how many of the flaws those experts later established it independently surfaced.
Honest by construction: many expert flaws (a spreadsheet error, a hidden confound, a result that only fails on re-analysis) are not visible in an abstract at all. Scoring the engine for missing those would be dishonest. So every expert flaw is labelled abstract-detectable vs full-text/external-only, and recall is computed only over the abstract-detectable flaws. Of the 46 expert flaws across the cohort, 19 were full-text/external-only and excluded — the ceiling abstract-only access imposes.
How the number is earned
- 1. Blind critique. An isolated agent critiques each target from its title + abstract only, with no sight of the expert verdict — the isolation is the blindness guarantee.
- 2. Ground truth. A separate agent decomposes the authoritative published critique into discrete flaws and labels each abstract-detectable vs full-text/external-only.
- 3. Agreement. A strict + charitable panel scores, per detectable flaw, whether a blind concern substantively matches (refute-by-default; topical overlap does not count).
- 4. Adversarial audit. An independent run re-judges every credited match, re-classifies detectability blind to the engine, and scans the blind critiques for leaked outside knowledge and fabricated rigour. A match counts only if it survives all of this.
The audit mattered. It cut the headline from a pre-audit 89% to a confirmed 63% — overturning 6 matches as merely topical and voiding 1 that leaned on knowledge a blind reader could not have. The confirmed figure is the headline precisely because it is the one that survived being attacked.
Target by target
Each abstract-detectable flaw the authoritative critique rests on, and whether Critical AI’s blind critique surfaced it. Full-text/external-only flaws are listed muted — they are outside abstract reach and excluded from recall.
- AI-directComputational social science50% (2/4)
Measuring the predictability of life outcomes with a scientific mass collaboration
vs Garip (2020), PNAS invited commentary, 'What failure to predict life outcomes can teach us'
- ✓ surfaced (audit-confirmed)The headline 'practical limits to the predictability of life outcomes' generalizes far beyond what was actually shown — namely that one particular pipeline (a fixed set of FFCWS predictors plus ML applied to six specific outcomes) failed to predict well; the abstract itself hedges with 'in some settings,' exposing that the limit is conditional on this design rather than a general property of life trajectories.
engine, blind: “The headline framing 'How predictable are life trajectories?' is broad, but the evidence comes from a single cohort (Fragile Families) and exactly six outcomes, so the abstract's own hedge ('in some settings') sits in tension with the sweeping opening question.”
- ✓ surfaced (audit-confirmed)The benchmark comparison is the load-bearing claim ('only slightly better than a simple benchmark model'), but how weak or strong that benchmark is, and how 'accuracy' was scored (e.g., R^2 / hold-out MSE on rare or noisy outcomes), determines whether the conclusion of poor predictability is warranted; a near-baseline result can reflect a low ceiling set by outcome noise rather than model inadequacy.
engine, blind: “The central claim rests on a comparison to 'a simple benchmark model,' but the abstract never specifies what that benchmark is, so the reader cannot judge whether 'only slightly better' reflects genuine predictive limits or an unusually strong/weak baseline.”
- ✗ missedThe constructive thesis Garip foregrounds — that the value lies in the common-task framework and out-of-sample testing rather than any model's accuracy — is itself signaled by the abstract, which already pivots to 'illustrate the value of mass collaborations,' so the commentary's positive reframing builds on what the abstract concedes.
- ✗ topical only (audit-overturned)Concluding that 160 teams converging on near-benchmark accuracy demonstrates a genuine predictability ceiling assumes the teams collectively exhausted the useful modeling and feature-engineering space; if all teams worked from the same provided feature set under the same task constraints, their convergence reflects a shared input ceiling, not proof that the outcomes are intrinsically unpredictable.
Outside abstract reach (2): Failure to predict is attributed to limits of ML/data, but the result is equally consistent with limits of the specific measured predictors and the chosen outcomes (data-quality / feature-set ceiling, missingness, attrition in the cohort) rather than an intrinsic ceiling on predictability. · The finding that prediction error is 'strongly associated with the family being predicted and weakly with the technique' is presented as evidence about predictability limits, but it could instead reflect heterogeneous/idiosyncratic measurement error or a small number of hard-to-predict cases driving variance — an interpretation that requires examining the error distribution across families.
- ✓ surfaced (audit-confirmed)The headline 'practical limits to the predictability of life outcomes' generalizes far beyond what was actually shown — namely that one particular pipeline (a fixed set of FFCWS predictors plus ML applied to six specific outcomes) failed to predict well; the abstract itself hedges with 'in some settings,' exposing that the limit is conditional on this design rather than a general property of life trajectories.
- Economics100% (2/2)
The Colonial Origins of Comparative Development: An Empirical Investigation
vs Albouy (2012), American Economic Review, Comment
- ✓ surfaced (audit-confirmed)The identification rests entirely on a single instrument (settler mortality) whose validity and measurement are assumed rather than independently verified, creating a single point of failure for all downstream claims.
engine, blind: “The exclusion restriction for the instrument is asserted but not defended: European settler mortality is assumed to affect modern income only through institutions, yet historical mortality could plausibly proxy for present-day disease environment, geography, or human-capital channels that bear directly on income.”
- ✓ surfaced (audit-confirmed)The abstract makes strong, far-reaching causal claims (large effects of institutions; geography/Africa dummies become insignificant once institutions are controlled) from a cross-country observational design, which is vulnerable to fragile identification.
engine, blind: “The conclusion that Africa and equatorial countries lack lower incomes 'once institutions are controlled for' risks overclaiming, since it treats geography purely as a confounder to be partialled out while geography may operate partly through the same channel as the instrument and the mediator.”
Outside abstract reach (4): 36 of the 64 settler-mortality observations are not actual mortality rates measured in the country itself, but rates imputed/borrowed from other countries, making the key instrument's data partly fabricated by assignment. · The mortality series combines incomparable populations (laborers, bishops, soldiers on campaign) whose death rates are not on the same scale, and the combinations were made in directions favorable to the institutions hypothesis. · Once the data problems are corrected, the first-stage mortality-expropriation relationship is no longer robust, undermining instrument relevance. · After fixing the data, the IV estimates lose robustness, often producing effectively infinite confidence intervals — i.e. the headline 'large effects of institutions' is not statistically identified.
- ✓ surfaced (audit-confirmed)The identification rests entirely on a single instrument (settler mortality) whose validity and measurement are assumed rather than independently verified, creating a single point of failure for all downstream claims.
- Psychology (metascience)80% (4/5)
Estimating the reproducibility of psychological science
vs Gilbert, King, Pettigrew & Wilson (2016), Science, 'Comment on Estimating the reproducibility of psychological science'
- ✓ surfaced (audit-confirmed)Many replications were themselves underpowered, so a non-significant replication is expected by sampling error even when the original effect is real; the headline 'low replication rate' conflates true failure to replicate with the replications' own statistical limitations.
engine, blind: “Using statistical significance of the replication as a reproducibility criterion conflates effect existence with the dichotomous p<.05 outcome, and the abstract reports significance counts (97% vs 36%) without acknowledging that low replication power or smaller true effects, rather than non-reproducibility, could drive the gap.”
- ✓ surfaced (audit-confirmed)The 100 studies were not a representative random sample of the psychology literature; studies were selected under feasibility and assignment constraints, so the aggregate 'reproducibility' estimate cannot be generalized to the field as the abstract implies.
engine, blind: “The abstract does not state how the 100 studies were selected from the candidate pool in the three journals, so whether the estimate is representative or subject to selection effects cannot be assessed from the abstract alone.”
- ✗ topical only (audit-overturned)The 'subjective replication' endorsement criterion (and the family of dichotomous success metrics) is misleading: subjective ratings and significance-counting are noisy, low-reliability gauges that understate agreement between original and replication; better metrics (e.g., whether the original effect falls in the replication CI) suggest much higher reproducibility.
- ✓ surfaced (audit-confirmed)Comparing the magnitude of replication effects to original effects to infer 'a substantial decline' ignores regression to the mean and selection/publication bias inflating originals, so the halving of effect sizes does not by itself index irreproducibility.
engine, blind: “The claim that replication effects represent a 'substantial decline' interprets a halving of effect magnitude as decline, but regression to the mean and publication-driven inflation of originals are plausible alternative explanations the abstract does not address.”
- ✓ surfaced (audit-confirmed)Replications used 'original materials when available' — i.e., not always — and protocol/infidelity differences (population, setting, procedure) plausibly account for non-replications rather than the originals being false positives.
engine, blind: “'Original materials when available' implies some replications used non-original materials, a heterogeneity that could systematically depress observed replication and is not quantified in the abstract.”
Outside abstract reach (2): Gilbert et al.'s reanalysis using a high-fidelity benchmark (independent direct-replication confidence intervals) shows the observed replication rate is statistically consistent with ~100% true reproducibility once expected sampling error and infidelity are accounted for. · Endorsement/CI-overlap metrics were not adjusted for the error rate of the replication estimates themselves, so the count of 'replications whose CI excludes the original effect' overstates failures because some exclusions arise purely from replication noise.
- ✓ surfaced (audit-confirmed)Many replications were themselves underpowered, so a non-significant replication is expected by sampling error even when the original effect is real; the headline 'low replication rate' conflates true failure to replicate with the replications' own statistical limitations.
- Social psychology75% (3/4)
Power Posing
vs Ranehill et al. (2015), Psychological Science (large replication)
- ✓ surfaced (audit-confirmed)The original study was underpowered: its sample was too small to reliably estimate the claimed hormonal and behavioral effects, making them vulnerable to false positives.
engine, blind: “The abstract reports a single study yet draws broad causal physiological and behavioral conclusions, with no sample size disclosed, leaving statistical power and reliability unverifiable.”
- ✗ missedOnly the self-reported 'feelings of power' effect replicates; the subjective self-report outcome is the weakest, most demand-susceptible measure and cannot license the physiological/behavioral conclusions.
- ✓ surfaced (audit-confirmed)The abstract over-generalizes to 'real-world, actionable implications' from a single laboratory study, asserting instant, universal effects not warranted by the evidence base.
engine, blind: “The closing sentence extrapolates from a lab manipulation to sweeping real-world prescriptions, treating a 1-minute pose as something that makes a person 'instantly become more powerful' — a generalization the within-study evidence described cannot support.”
- ✓ surfaced (audit-confirmed)The strong causal claim ('can these postures actually cause power?... posing... caused... changes') rests on a single original study without independent confirmation, a contested-phenomenon risk.
engine, blind: “No information is given on replication, data, or code availability, so the strong actionable claims rest on a single unreplicated study as described in the abstract.”
Outside abstract reach (2): The original study's neuroendocrine and behavioral effects (testosterone increase, cortisol decrease, risk tolerance) fail to replicate in a larger, better-powered sample (N=200), so the physiological and behavioral causal claims are not robust. · Cortisol and testosterone responses are noisy, assay- and timing-sensitive measures whose claimed clean directional effects may reflect uncontrolled measurement/analytic variability rather than a true posture effect.
- ✓ surfaced (audit-confirmed)The original study was underpowered: its sample was too small to reliably estimate the claimed hormonal and behavioral effects, making them vulnerable to false positives.
- Economics67% (2/3)
The Impact of Legalized Abortion on Crime
vs Foote & Goetz (2008), Quarterly Journal of Economics, Comment
- ✗ topical only (audit-overturned)The cross-state test (high-abortion states in the 1970s-80s showing greater 1990s crime declines) is not robust to allowing differential state-specific time trends; once such trends are permitted, the abortion-crime relationship is not reliably identified.
- ✓ surfaced (audit-confirmed)The attribution of 'as much as 50 percent' of the crime drop to legalized abortion is an overstated magnitude claim that collapses once the coding error is fixed and per-capita / trend-robust specifications are used.
engine, blind: “The 50 percent figure is presented as an upper bound ('as much as') yet functions as the abstract's takeaway, an overclaiming pattern where the most striking number is foregrounded without the accompanying lower bound, confidence interval, or uncertainty.”
- ✓ surfaced (audit-confirmed)The overall identification leans on observational timing coincidences (crime falling 'roughly eighteen years after' legalization; five early-legalizing states declining earlier) that are confounded by many other contemporaneous determinants of 1990s crime trends.
engine, blind: “The five-state early-legalization comparison rests on a very small treated group whose crime trends could diverge for many reasons unrelated to abortion; the abstract reports the directional difference without any control for the states' other characteristics.”
Outside abstract reach (2): The headline within-state cohort result ('only arrests of those born after abortion legalization fall relative to low abortion states') rests on a regression that was intended to include state-year interaction terms but, due to a coding mistake, omitted them; correcting this guts the effect. · The within-state results depend on using arrest counts rather than a per-capita (arrests-divided-by-population) specification; moving to per-capita crime rates sharply weakens the estimated abortion effect because cohort size itself drives arrest counts.
- Social psychology67% (2/3)
Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis.
vs Wagenmakers et al. (2016), Registered Replication Report, Perspectives on Psychological Science
- ✗ topical only (audit-overturned)The original two-study finding rests on a small single-source sample, making the reported effect vulnerable to sampling variability and false-positive inflation typical of pre-power-revolution social psychology.
- ✓ surfaced (audit-confirmed)The pen-in-mouth manipulation is a subtle, demand-prone, and fragile procedure whose effect on self-reported humor may not survive procedural variation (e.g., presence/absence of cameras, instructions, stimuli) across labs.
engine, blind: “The sole outcome appears to be a subjective self-report ('reported more intense humor responses'), so the affective effect is measured by demand-sensitive ratings rather than any independent behavioral or physiological index of emotion.”
- ✓ surfaced (audit-confirmed)The abstract over-interprets the data by drawing fine-grained theoretical conclusions (facial feedback affects the affective but not the cognitive component; both inhibitory and facilitatory mechanisms contribute) that depend on an effect that itself does not reliably exist.
engine, blind: “The dissociation claim that 'facial feedback operates on the affective but not on the cognitive component' is a claim of a null effect on cognition, which the abstract presents as a positive finding without indicating it was powered to detect such an absence.”
Outside abstract reach (2): The headline effect (induced smiling increases rated funniness) does not replicate: a pooled estimate across 17 high-powered direct replications is 0.03 rating units (95% CI -0.11 to 0.16), versus the original 0.82, indicating the original effect is essentially absent or massively overstated. · The reported original effect size (0.82) is implausibly large relative to what the construct can plausibly produce, consistent with publication-era effect-size inflation rather than a stable phenomenon.
- Economics50% (1/2)
Growth in a Time of Debt
vs Herndon, Ash & Pollin (2014), Cambridge Journal of Economics
- ✓ surfaced (audit-confirmed)The headline causal/policy reading -- that crossing a 90% debt/GDP threshold sharply depresses growth -- rests on a discontinuity (median growth falls 1%, average falls much more above 90%) that the abstract itself presents as a sharp threshold despite an associational, observational design.
engine, blind: “The abstract describes only a correlational threshold relationship yet the framing 'Growth in a Time of Debt' and the structure of the claims invite a causal reading (debt depresses growth) that the stated descriptive methods cannot support.”
- ✗ topical only (audit-overturned)Pooling forty-four heterogeneous countries across ~200 years with wildly unequal numbers of observations per country invites sensitivity to how unequal-length panels are averaged.
Outside abstract reach (4): A spreadsheet (Excel) coding error caused several countries (e.g., Australia, Austria, Belgium, Canada, Denmark) to be silently omitted from the high-debt average, mechanically lowering the reported growth rate for the 90%+ debt category. · Selective exclusion of available country-year data (e.g., New Zealand 1946-49, early postwar Canada/Australia) dropped low-growth-erasing or high-growth high-debt episodes, biasing the high-debt average downward. · An unconventional weighting scheme (weighting each country equally rather than by the number of country-year observations, so a single year for one country counts as much as decades for another) distorted the high-debt growth estimate. · Correcting all three problems raises average real growth for the high-debt (>90%) group to +2.2% from the published -0.1%, so the headline 90% threshold / cliff in growth does not exist.
- ✓ surfaced (audit-confirmed)The headline causal/policy reading -- that crossing a 90% debt/GDP threshold sharply depresses growth -- rests on a discontinuity (median growth falls 1%, average falls much more above 90%) that the abstract itself presents as a sharp threshold despite an associational, observational design.
- Psychology / neuroscience25% (1/4)
Methylphenidate Blocks Effort-Induced Depletion of Regulatory Control in Healthy Volunteers
vs Hagger et al. (2016), Registered Replication Report, Perspectives on Psychological Science
- ✗ voided (used leaked knowledge)The paper treats regulatory depletion as a settled, well-replicated phenomenon ('more than 100 studies over the last decade') and builds its entire claim on that premise, rather than treating it as a contested effect requiring independent verification.
- ✗ missedIf the baseline depletion effect is itself absent/near-zero, there is no genuine depletion for methylphenidate to 'block,' so the headline causal claim that the drug reverses a real regulatory-control deficit is unsupported.
- ✗ topical only (audit-overturned)The study is a single-lab demonstration whose central effect (a drug-by-depletion interaction) is reported with strong, definitive language ('fully blocks', 'demonstrated specificity') without independent replication.
- ✓ surfaced (audit-confirmed)The secondary mechanistic finding (spectral analysis localizing methylphenidate effects to the slow-4 band, linked to mind-wandering brain networks) is an exploratory, post-hoc neurophysiological interpretation layered on top of the unverified depletion effect.
engine, blind: “The leap from a behavioral RT spectral signature to claims about resting-state brain networks and mind wandering is a reverse-inference: an association of a band with networks elsewhere does not establish that those networks drove this effect, and no neuroimaging is described.”
Outside abstract reach (1): The foundational ego-depletion (sequential-task regulatory depletion) effect that the paper presumes is not robust: a 23-lab preregistered replication (~2000 participants) found a meta-analytic effect indistinguishable from zero (d ~ 0.04).
Self-improvement, gated on held-out evidence
The benchmark doesn’t just measure — it diagnoses where the engine falls short, into 6reusable failure modes that feed the next critique’s self-check. But an improvement you assume works is not an improvement. So a candidate lesson-set activates only if it beats a no-lessons baseline on a held-out A/B — fresh papers it was never derived from, scored within one run (the only valid comparison, since the flaw decomposition is re-run each time).
v1 — "be more specific" — sharpen every concern to a specific mechanism: FAILED → not activatedlessons 60% vs baseline 80%
REGRESSED (delta -0.20): pruned correct-simple concerns and over-reached into specific-but-wrong ones. Failed the gate; not activated.
v2 — additive — sharpen only where the abstract licenses it; never drop a valid concern or invent a false specific one: PASSED → activatedlessons 73% vs baseline 64%
PASSED (delta +0.09): zero per-target regression, zero pruning; +1 real substantive catch (Brady length confound). Activated, with the margin (one flaw) disclosed.
v1 (“be more specific”) regressed by pruning correct-but-simple concerns; the v2 additive meta-rule (sharpen only where the abstract licenses it, never drop a valid concern) removed that harm and passed by a one-flaw marginwith zero pruning — a safe, modest, honestly-bounded improvement, not a large effect. The loop is falsifiable, and it caught v1’s regression before it could degrade a published critique.
The same gate, applied to the faithfulness loop — a humbling result.
The journal’s in-sample faithfulness claim (the self-check lifted faithful-rate 63%→100%) was held to the same held-out A/B. On 4 fresh papers, the 7 faithfulness lessons reduced source-strengthening from 3 to 2 — held-out safe (never increased strengthening), but marginal, not the dramatic in-sample lift: the baseline blind prompt is already near-faithful (2 of 4 papers had zero in both arms). The in-sample headline does not replicate as a large held-out effect — stated, not hidden.
What this proves — and what it doesn’t
It proves that, reading only abstracts and blind to the expert verdict, Critical AI independently surfaced 63% of the abstract-detectable flaws that authoritative published human critiques later established — and did so with 0 instances of fabricated rigour by the strict panel across 67blind concerns (the G54 audit noted the charitable panel flagged one, on power-posing — disclosed, not adjudicated as material). That is evidence the engine’s judgement is often correct, not merely well-formed.
It does not prove absolute correctness. Agreement with one authoritative critique is a strong proxy for being right, not a guarantee. The cohort is small (8 targets). Recall is bounded by abstract-only access — 19 of 46 expert flaws were full-text/external-only and could not be reached. And the audit found 1 match that looked right but leaned on outside knowledge, which was removed. The figure is deliberately the one that survived an adversarial attempt to lower it. Machine-readable at /critique/api/correctness.