{"$schema":"https://policywindow.org/critique/api/schema","name":"Critical AI — correctness-agreement benchmark","description":"Whether Critical AI's critiques are correct, not merely faithful. Each target is a paper with an authoritative published human critique (Comment/replication/reanalysis). Reading only title + abstract and blind to the expert verdict, the engine's critique is scored on how many abstract-detectable expert flaws it independently surfaced; an adversarial audit then re-attacks the scoring. Headline = confirmed recall (survived the audit, leakage-free). Recall is computed only over abstract-detectable flaws; full-text/external-only flaws are excluded as the abstract-only ceiling.","docs":"https://policywindow.org/critique/correctness","run_date":"2026-06-20","headline":{"confirmed_recall":0.63,"confirmed_matched":17,"detectable_flaws":27,"pre_audit_strict_recall":0.889,"fulltext_or_external_flaws_excluded":19,"total_expert_flaws":46,"genuine_overclaims":0,"total_blind_concerns":67,"audit":{"overturns_confirmed":6,"leakage_voids":1,"rejected_false_alarm_flags":2}},"targets":[{"slug":"salganik-fragile-families","targetTitle":"Measuring the predictability of life outcomes with a scientific mass collaboration","targetDoi":"10.1073/pnas.1915006117","aiRelated":true,"field":"Computational social science","blindConcernCount":6,"detectableFlaws":4,"fulltextOnlyFlaws":2,"strictMatched":3,"confirmedMatched":2,"confirmedRecall":0.5,"strictRecall":0.75,"overturned":1,"voided":0},{"slug":"colonial-origins","targetTitle":"The Colonial Origins of Comparative Development: An Empirical Investigation","targetDoi":"10.1257/aer.91.5.1369","aiRelated":false,"field":"Economics","blindConcernCount":8,"detectableFlaws":2,"fulltextOnlyFlaws":4,"strictMatched":2,"confirmedMatched":2,"confirmedRecall":1,"strictRecall":1,"overturned":0,"voided":0},{"slug":"osc-reproducibility","targetTitle":"Estimating the reproducibility of psychological science","targetDoi":"10.1126/science.aac4716","aiRelated":false,"field":"Psychology (metascience)","blindConcernCount":9,"detectableFlaws":5,"fulltextOnlyFlaws":2,"strictMatched":5,"confirmedMatched":4,"confirmedRecall":0.8,"strictRecall":1,"overturned":1,"voided":0},{"slug":"power-posing-original","targetTitle":"Power Posing","targetDoi":"10.1177/0956797610383437","aiRelated":false,"field":"Social psychology","blindConcernCount":8,"detectableFlaws":4,"fulltextOnlyFlaws":2,"strictMatched":3,"confirmedMatched":3,"confirmedRecall":0.75,"strictRecall":0.75,"overturned":0,"voided":0},{"slug":"abortion-crime","targetTitle":"The Impact of Legalized Abortion on Crime","targetDoi":"10.1162/00335530151144050","aiRelated":false,"field":"Economics","blindConcernCount":9,"detectableFlaws":3,"fulltextOnlyFlaws":2,"strictMatched":3,"confirmedMatched":2,"confirmedRecall":0.667,"strictRecall":1,"overturned":1,"voided":0},{"slug":"facial-feedback-original","targetTitle":"Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis.","targetDoi":"10.1037/0022-3514.54.5.768","aiRelated":false,"field":"Social psychology","blindConcernCount":8,"detectableFlaws":3,"fulltextOnlyFlaws":2,"strictMatched":3,"confirmedMatched":2,"confirmedRecall":0.667,"strictRecall":1,"overturned":1,"voided":0},{"slug":"reinhart-rogoff-growth-debt","targetTitle":"Growth in a Time of Debt","targetDoi":"10.1257/aer.100.2.573","aiRelated":false,"field":"Economics","blindConcernCount":10,"detectableFlaws":2,"fulltextOnlyFlaws":4,"strictMatched":2,"confirmedMatched":1,"confirmedRecall":0.5,"strictRecall":1,"overturned":1,"voided":0},{"slug":"ego-depletion-original","targetTitle":"Methylphenidate Blocks Effort-Induced Depletion of Regulatory Control in Healthy Volunteers","targetDoi":"10.1177/0956797614526415","aiRelated":false,"field":"Psychology / neuroscience","blindConcernCount":9,"detectableFlaws":4,"fulltextOnlyFlaws":1,"strictMatched":3,"confirmedMatched":1,"confirmedRecall":0.25,"strictRecall":0.75,"overturned":1,"voided":1}],"self_improvement":{"correctness_lessons":[{"id":"generic-alternative-not-specific-mechanism","pattern":"Generic alternative-explanations gesture: raising 'the result doesn't rule out confounds / data limitations / other explanations' as a list, when the load-bearing flaw is ONE specific mechanism the design invites.","example":"Saying a low-accuracy prediction result 'doesn't rule out data limitations, measurement error, or feature coverage' (a generic list) — when the specific point is that many teams sharing ONE provided feature set demonstrate a shared INPUT ceiling, not intrinsic unpredictability.","guidance":"If you raise an alternative-explanation or confounding concern, name the SPECIFIC channel/mechanism the design actually invites and why, in one sentence. A generic 'doesn't rule out X, Y, Z' list does not engage the substantive flaw. Ask: what is the ONE thing about THIS design that makes the headline fragile?","foundIn":["salganik-fragile-families"],"addedDate":"2026-06-21"},{"id":"wrong-statistical-problem","pattern":"Wrong statistical problem: flagging a related-but-different statistical threat than the one the design actually invites.","example":"Flagging serial correlation / non-independence / 'treating observations as exchangeable' when the design's actual sensitivity is to how an UNBALANCED panel (wildly unequal observations per unit) is AVERAGED/WEIGHTED — a different problem.","guidance":"Pin the specific statistical threat the stated design creates (unbalanced-panel weighting vs autocorrelation vs multiple comparisons vs measurement error are NOT interchangeable). Do not substitute a generic or adjacent statistical worry for the one that bites.","foundIn":["reinhart-rogoff-growth-debt"],"addedDate":"2026-06-21"},{"id":"state-the-direction-of-bias","pattern":"Direction-free criterion-dependence: noting a headline is 'criterion-dependent' or that multiple metrics disagree, without stating WHICH DIRECTION the reported choice biases the conclusion.","example":"Observing that several reproducibility rates are reported and the takeaway is 'underdetermined' — when the substantive point is directional: the conservative dichotomous metrics UNDERSTATE agreement, so the pessimistic headline is biased downward.","guidance":"When a headline rests on a criterion/specification/measurement choice, state the DIRECTION (does the chosen criterion over- or under-state the effect?) and which alternative would move it. 'It varies' is weaker than 'this choice biases the result downward'.","foundIn":["osc-reproducibility"],"addedDate":"2026-06-21"},{"id":"replication-not-just-power","pattern":"Replication risk reduced to reporting/power: for a small single-study or single-lab finding, complaining only that 'no sample/effect size is reported' instead of naming the replicability / false-positive fragility.","example":"Saying a definitive single-lab result 'reports no sample size so power can't be judged' — when the substantive flaw is that a surprising single-lab effect with definitive language ('fully blocks', 'caused') is vulnerable to non-replication / false-positive inflation and needs independent replication.","guidance":"For a single-study/single-lab finding with strong or surprising claims, name the REPLICATION risk explicitly (false-positive fragility; needs independent replication), not merely undisclosed statistics. Undisclosed power is a reporting gap; non-replicability is the substantive risk.","foundIn":["ego-depletion-original","facial-feedback-original"],"addedDate":"2026-06-21"},{"id":"observational-to-causal-name-the-threat","pattern":"Generic 'correlation ≠ causation' for a causal/IV claim, instead of naming the SPECIFIC identification threat.","example":"Saying a cross-state or cross-country causal claim is 'vulnerable to omitted-variable bias' generically — when the specific threat is, e.g., non-robustness to differential unit-specific trends, or an instrument whose exclusion restriction fails through a named channel.","guidance":"For a causal/IV/quasi-experimental claim from observational data, name the SPECIFIC identification threat: the exact confounder/channel, the failing exclusion restriction, or the specification (e.g. differential trends) the result is not robust to. A bare 'correlation isn't causation' does not engage the design.","foundIn":["colonial-origins","abortion-crime"],"addedDate":"2026-06-21"},{"id":"ground-only-in-provided-text","pattern":"Outside-knowledge leakage: asserting a paper's premise is 'contested' or 'failed to replicate' using knowledge NOT in the provided abstract/title — which a blind, abstract-only critique cannot legitimately invoke.","example":"Critiquing a study built on a large prior literature by asserting that literature's 'replicability has been widely contested' — importing replication-crisis knowledge the abstract never states.","guidance":"Ground every concern ONLY in the provided text. You may note that a paper TREATS a premise as settled and rests heavily on it (abstract-visible); you may NOT assert the premise is false or contested unless the text says so. Do not import outside findings, replication outcomes, or literatures the abstract does not mention.","foundIn":["ego-depletion-original"],"addedDate":"2026-06-21"}],"correctness_held_out_runs":[{"version":"v1","runDate":"2026-06-21","framing":"\"be more specific\" — sharpen every concern to a specific mechanism","heldOutTargets":4,"perTarget":[{"slug":"duckworth-grit","aiRelated":false,"field":"Psychology / education","detectable":4,"baselineMatched":4,"lessonsMatched":2,"baselineConcernCount":6,"lessonsConcernCount":6},{"slug":"chetty-teachers-va","aiRelated":false,"field":"Economics (labour/education)","detectable":2,"baselineMatched":2,"lessonsMatched":2,"baselineConcernCount":7,"lessonsConcernCount":5},{"slug":"brady-moral-contagion","aiRelated":true,"field":"Computational social science","detectable":3,"baselineMatched":1,"lessonsMatched":1,"baselineConcernCount":7,"lessonsConcernCount":6},{"slug":"breznau-welfare-responsiveness","aiRelated":false,"field":"Political sociology","detectable":1,"baselineMatched":1,"lessonsMatched":1,"baselineConcernCount":8,"lessonsConcernCount":6}],"verdict":"REGRESSED (delta -0.20): pruned correct-simple concerns and over-reached into specific-but-wrong ones. Failed the gate; not activated."},{"version":"v2","runDate":"2026-06-21","framing":"additive — sharpen only where the abstract licenses it; never drop a valid concern or invent a false specific one","heldOutTargets":4,"perTarget":[{"slug":"duckworth-grit","aiRelated":false,"field":"Psychology / education","detectable":3,"baselineMatched":2,"lessonsMatched":2,"baselineConcernCount":7,"lessonsConcernCount":7},{"slug":"chetty-teachers-va","aiRelated":false,"field":"Economics (labour/education)","detectable":2,"baselineMatched":1,"lessonsMatched":1,"baselineConcernCount":7,"lessonsConcernCount":6},{"slug":"brady-moral-contagion","aiRelated":true,"field":"Computational social science","detectable":4,"baselineMatched":3,"lessonsMatched":4,"baselineConcernCount":6,"lessonsConcernCount":6},{"slug":"breznau-welfare-responsiveness","aiRelated":false,"field":"Political sociology","detectable":2,"baselineMatched":1,"lessonsMatched":1,"baselineConcernCount":6,"lessonsConcernCount":7}],"verdict":"PASSED (delta +0.09): zero per-target regression, zero pruning; +1 real substantive catch (Brady length confound). Activated, with the margin (one flaw) disclosed."}],"correctness_lessons_active":true,"faithfulness_held_out":{"runDate":"2026-06-21","heldOutTargets":4,"lessonCount":7,"baselineStrengthenings":3,"lessonsStrengthenings":2,"baselineGenuine":1,"lessonsGenuine":0,"baselineConcerns":27,"lessonsConcerns":26,"perTarget":[{"slug":"duckworth-grit","baselineStrengthenings":0,"lessonsStrengthenings":0,"baselineConcernCount":6,"lessonsConcernCount":6,"judgeTitleArtifacts":0},{"slug":"chetty-teachers-va","baselineStrengthenings":0,"lessonsStrengthenings":0,"baselineConcernCount":6,"lessonsConcernCount":7,"judgeTitleArtifacts":0},{"slug":"brady-moral-contagion","baselineStrengthenings":2,"lessonsStrengthenings":2,"baselineConcernCount":7,"lessonsConcernCount":6,"judgeTitleArtifacts":2},{"slug":"breznau-welfare-responsiveness","baselineStrengthenings":1,"lessonsStrengthenings":0,"baselineConcernCount":8,"lessonsConcernCount":7,"judgeTitleArtifacts":0}],"verdict":"Held-out SAFE + marginally positive: faithfulness lessons never increased strengthening and eliminated one (title-corrected 1 -> 0). The dramatic in-sample 63%->100% does NOT replicate as a large held-out effect — the baseline blind prompt is already near-faithful (2 of 4 papers zero in both arms)."},"faithfulness_held_out_safe":true,"note":"A candidate lesson-set activates in generation ONLY if it beats a no-lessons baseline on a held-out A/B (within-run comparison; the decomposition is re-run each time). Correctness v1 regressed and was rejected; the additive v2 passed (one-flaw margin, no pruning) and is active. The faithfulness lessons are held-out safe but marginal — the in-sample 63%->100% does not replicate as a large effect."},"records":[{"slug":"salganik-fragile-families","targetTitle":"Measuring the predictability of life outcomes with a scientific mass collaboration","targetDoi":"10.1073/pnas.1915006117","openalexId":"https://openalex.org/W3013182397","aiRelated":true,"field":"Computational social science","expertSource":"Garip (2020), PNAS invited commentary, 'What failure to predict life outcomes can teach us'","expertCritique":"An invited PNAS commentary arguing that the mass-collaboration finding that machine-learning models barely beat a simple benchmark exposes real limits of predictive ML in social science, and that the value lies in the common-task framework and out-of-sample testing rather than in any individual model's accuracy. It reframes the celebrated ML exercise as evidence of how little predictive purchase rich data plus ML actually buys for individual life outcomes.","blindConcerns":[{"id":"c1","summary":"The headline framing 'How predictable are life trajectories?' is broad, but the evidence comes from a single cohort (Fragile Families) and exactly six outcomes, so the abstract's own hedge ('in some settings') sits in tension with the sweeping opening question.","dimension":"generalisation","abstractBasis":"'How predictable are life trajectories?' ... 'data from the Fragile Families and Child Wellbeing Study' ... 'practical limits to the predictability of life outcomes in some settings'","confidence":"high"},{"id":"c2","summary":"The central claim rests on a comparison to 'a simple benchmark model,' but the abstract never specifies what that benchmark is, so the reader cannot judge whether 'only slightly better' reflects genuine predictive limits or an unusually strong/weak baseline.","dimension":"methods","abstractBasis":"'the best predictions ... were only slightly better than those from a simple benchmark model'","confidence":"high"},{"id":"c3","summary":"Accuracy is characterized only qualitatively ('not very accurate', 'only slightly better', 'slightly'), with no effect sizes, error metrics, or confidence intervals reported in the abstract, making the strength of the conclusion impossible to assess from the abstract alone.","dimension":"statistics","abstractBasis":"'the best predictions were not very accurate and were only slightly better' ... 'weakly associated with the technique used'","confidence":"high"},{"id":"c4","summary":"Low predictive accuracy in this dataset is interpreted as 'practical limits to the predictability of life outcomes,' but the abstract does not rule out alternative explanations such as data limitations, measurement error in the outcomes, or feature coverage, conflating a ceiling on these models with a ceiling on predictability itself.","dimension":"overclaiming","abstractBasis":"'these results suggest practical limits to the predictability of life outcomes' following 'the best predictions were not very accurate'","confidence":"medium"},{"id":"c5","summary":"The finding that error is 'strongly associated with the family being predicted and weakly associated with the technique' is presented as a substantive result, but the abstract gives no quantification of these associations, leaving the relative magnitudes asserted rather than shown.","dimension":"claims","abstractBasis":"'prediction error was strongly associated with the family being predicted and weakly associated with the technique used to generate the prediction'","confidence":"medium"},{"id":"c6","summary":"Using 160 teams via the common task method is a credible design, but the abstract reports only the 'best predictions' and aggregate associations, not how outcomes were scored, held out, or guarded against overfitting to a shared test set, so reproducibility of the central comparison cannot be evaluated from the abstract.","dimension":"reproducibility","abstractBasis":"'160 teams built predictive models for six life outcomes ... using the common task method'","confidence":"low"}],"expertFlaws":[{"id":"f1","claim":"The headline 'practical limits to the predictability of life outcomes' generalizes far beyond what was actually shown — namely that one particular pipeline (a fixed set of FFCWS predictors plus ML applied to six specific outcomes) failed to predict well; the abstract itself hedges with 'in some settings,' exposing that the limit is conditional on this design rather than a general property of life trajectories.","detectability":"abstract_detectable","justification":"The abstract poses the grand question 'How predictable are life trajectories?' and then concedes 'in some settings,' so a careful reader can see the gap between the sweeping framing and the conditional result without reading the full paper. Garip's reframing ('how little predictive purchase rich data plus ML buys') is essentially this scope point."},{"id":"f2","claim":"Failure to predict is attributed to limits of ML/data, but the result is equally consistent with limits of the specific measured predictors and the chosen outcomes (data-quality / feature-set ceiling, missingness, attrition in the cohort) rather than an intrinsic ceiling on predictability.","detectability":"fulltext_or_external","justification":"Distinguishing 'ML can't do better' from 'these particular Fragile Families variables and their measurement/missingness can't support better prediction' requires inspecting the dataset's variables, missing-data structure, and the outcome operationalizations — not visible from the abstract."},{"id":"f3","claim":"The benchmark comparison is the load-bearing claim ('only slightly better than a simple benchmark model'), but how weak or strong that benchmark is, and how 'accuracy' was scored (e.g., R^2 / hold-out MSE on rare or noisy outcomes), determines whether the conclusion of poor predictability is warranted; a near-baseline result can reflect a low ceiling set by outcome noise rather than model inadequacy.","detectability":"abstract_detectable","justification":"The abstract makes the benchmark comparison the criterion for the entire conclusion, so a reader can legitimately flag that the whole inference is criterion-dependent and hinges on an unspecified benchmark and scoring choice — even if confirming the specifics needs the full text."},{"id":"f4","claim":"The constructive thesis Garip foregrounds — that the value lies in the common-task framework and out-of-sample testing rather than any model's accuracy — is itself signaled by the abstract, which already pivots to 'illustrate the value of mass collaborations,' so the commentary's positive reframing builds on what the abstract concedes.","detectability":"abstract_detectable","justification":"The abstract's final sentence explicitly relocates the contribution from prediction accuracy to the collaboration/method, which is precisely the reframing Garip endorses; a reader sees this directly."},{"id":"f5","claim":"The finding that prediction error is 'strongly associated with the family being predicted and weakly with the technique' is presented as evidence about predictability limits, but it could instead reflect heterogeneous/idiosyncratic measurement error or a small number of hard-to-predict cases driving variance — an interpretation that requires examining the error distribution across families.","detectability":"fulltext_or_external","justification":"Whether per-family error reflects an irreducible predictability floor versus heteroskedastic noise or outliers cannot be adjudicated from the abstract's one-sentence summary; it needs the underlying per-case error data."},{"id":"f6","claim":"Concluding that 160 teams converging on near-benchmark accuracy demonstrates a genuine predictability ceiling assumes the teams collectively exhausted the useful modeling and feature-engineering space; if all teams worked from the same provided feature set under the same task constraints, their convergence reflects a shared input ceiling, not proof that the outcomes are intrinsically unpredictable.","detectability":"abstract_detectable","justification":"The abstract states all teams used the same dataset and common task and that technique mattered little — a reader can reason that shared inputs plus a shared task naturally cap how much technique can vary, qualifying the 'limits to predictability' inference without needing the full paper."}],"strict":{"judgements":[{"flawId":"f1","matched":true,"matchedConcernId":"c1","evidence":"c1 names exactly the gap f1 identifies: the broad headline question 'How predictable are life trajectories?' versus a single cohort with six outcomes, and explicitly notes the abstract's own hedge ('in some settings') 'sits in tension with the sweeping opening question.' This is the same overgeneralization/conditional-scope problem f1 describes (sweeping framing vs. a limit conditional on this design), including the same use of the 'in some settings' hedge as evidence."},{"flawId":"f3","matched":true,"matchedConcernId":"c2","evidence":"c2 directly targets the load-bearing benchmark claim: 'the central claim rests on a comparison to a simple benchmark model, but the abstract never specifies what that benchmark is, so the reader cannot judge whether only slightly better reflects genuine predictive limits or an unusually strong/weak baseline.' This matches f3's core point that the conclusion hinges on an unspecified benchmark. c3 separately notes missing scoring metrics, but c2 captures the substantive criterion-dependence f3 raises."},{"flawId":"f4","matched":false,"matchedConcernId":"","evidence":"f4 is a constructive observation that the abstract already pivots the contribution to 'illustrate the value of mass collaborations,' endorsing Garip's positive reframing. No blind concern engages with this pivot to collaboration value as the relocated contribution. The concerns treat the collaboration only as a design element to critique (c6) rather than recognizing the abstract's own relocation of the contribution. No substantive match."},{"flawId":"f6","matched":true,"matchedConcernId":"c4","evidence":"f6 argues that near-benchmark convergence proves a shared input/feature ceiling rather than intrinsic unpredictability. c4 makes the same inferential point: low accuracy 'is interpreted as practical limits to the predictability of life outcomes, but the abstract does not rule out alternative explanations such as data limitations... or feature coverage, conflating a ceiling on these models with a ceiling on predictability itself.' Both identify the conflation of a model/input ceiling with a predictability ceiling. c6 raises overfitting/shared-test-set worries but does not make f6's shared-feature-ceiling argument; c4 is the substantive match."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f1","matched":true,"matchedConcernId":"c1","evidence":"c1 states the 'How predictable are life trajectories?' framing is broad while evidence comes from a single cohort and six outcomes, and explicitly flags tension with the 'in some settings' hedge. This substantively matches f1's point that the headline generalizes beyond a single FFCWS pipeline applied to six outcomes, with the abstract's own 'in some settings' hedge exposing the conditional nature of the limit. Same scope/over-generalization flaw, same evidentiary hook (the hedge)."},{"flawId":"f3","matched":true,"matchedConcernId":"c2","evidence":"c2 says the central claim rests on an unspecified 'simple benchmark model' so the reader cannot judge whether 'only slightly better' reflects genuine limits or a weak/strong baseline. This is the load-bearing criterion-dependence point in f3. c3 separately raises lack of error metrics/scoring quantification, which is the other half of f3, but c2 most directly captures the benchmark-dependence flaw. The scoring/'low ceiling from outcome noise' nuance in f3 is only partially covered, but the core flaw — the conclusion hinges on an unspecified benchmark — is substantively identified."},{"flawId":"f4","matched":true,"matchedConcernId":"","evidence":"f4 is a constructive/positive reframing: the abstract itself pivots the contribution to 'the value of mass collaborations,' which Garip endorses. No blind concern recognizes this pivot as constructive or builds on it; the concerns are all critical (c1-c6 all assert weaknesses). c1 mentions the framing tension and c6 touches the common-task design as 'credible,' but none identify that the abstract relocates the contribution from accuracy to the collaboration/method as a virtue. No substantive match."},{"flawId":"f6","matched":true,"matchedConcernId":"c6","evidence":"c6 notes all teams used the same dataset/common task and that the abstract does not show how outcomes were held out or guarded against overfitting to a shared test set. This overlaps with f6's point that shared inputs/task cap how much technique can vary. However c6 frames it as a reproducibility/overfitting gap rather than f6's specific 'shared feature set means convergence reflects an input ceiling, not intrinsic unpredictability' inference. c4 is closer on the underlying logic (low accuracy interpreted as predictability limits without ruling out data/feature limitations), explicitly naming 'feature coverage' as an alternative to a ceiling on predictability itself. c4 substantively captures f6's reasoning that the ceiling may be a shared-input ceiling rather than intrinsic unpredictability."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":["f6"],"voidedMatches":[],"detectabilityDisputes":[{"flawId":"f2","assigned":"fulltext_or_external","auditor":"abstract_detectable","disposition":"kept_assigned","note":"Borderline: the data/feature-ceiling alternative needs the dataset to prove, though the abstract exposes the single-dataset/six-outcome antecedent. Detectability survived adversarial re-classification on 45 of 46 flaws; the one disputed borderline case is kept conservatively as full-text-only rather than reclassified to inflate the denominator."}],"rejectedFlags":[]}},{"slug":"reinhart-rogoff-growth-debt","targetTitle":"Growth in a Time of Debt","targetDoi":"10.1257/aer.100.2.573","openalexId":"https://openalex.org/W2157028860","aiRelated":false,"field":"Economics","expertSource":"Herndon, Ash & Pollin (2014), Cambridge Journal of Economics","expertCritique":"Replicating Reinhart and Rogoff's claim that public debt above 90% of GDP is associated with sharply lower growth, the authors find a spreadsheet coding error, selective exclusion of available country-year data, and unconventional weighting. Corrected, average real growth for high-debt countries is +2.2%, not the published -0.1%, eliminating the supposed debt threshold.","blindConcerns":[{"id":"c1","summary":"The abstract describes only a correlational threshold relationship yet the framing 'Growth in a Time of Debt' and the structure of the claims invite a causal reading (debt depresses growth) that the stated descriptive methods cannot support.","dimension":"claims","abstractBasis":"the relationship between government debt and real GDP growth is weak for debt/GDP ratios below a threshold of 90 percent of GDP. Above 90 percent, median growth rates fall","confidence":"high"},{"id":"c2","summary":"The 90 percent figure is presented as a sharp threshold, but the abstract gives no evidence that the cutoff was estimated rather than imposed, nor any statistical test (e.g. structural break, confidence interval) establishing that 90 percent is a genuine breakpoint rather than a chosen bin boundary.","dimension":"statistics","abstractBasis":"for debt/GDP ratios below a threshold of 90 percent of GDP. Above 90 percent, median growth rates fall by one percent","confidence":"high"},{"id":"c3","summary":"Findings rest on median and average growth rates by debt category without any reported measure of dispersion, statistical significance, or uncertainty, so the magnitude claims ('one percent', 'about two percent', 'cut in half') cannot be assessed for robustness.","dimension":"statistics","abstractBasis":"median growth rates fall by one percent, and average growth falls considerably more","confidence":"high"},{"id":"c4","summary":"The divergence between the median falling 'one percent' and the average falling 'considerably more' signals heavy skew or influential outliers in the high-debt group, which the abstract neither quantifies nor explains, leaving the reader unable to judge whether a few episodes drive the headline.","dimension":"statistics","abstractBasis":"median growth rates fall by one percent, and average growth falls considerably more","confidence":"medium"},{"id":"c5","summary":"Pooling roughly 3,700 annual observations across 44 countries and 200 years risks treating serially correlated, non-independent country-year observations as exchangeable, and the abstract reports no method for handling autocorrelation, country weighting, or panel structure.","dimension":"methods","abstractBasis":"over 3,700 annual observations covering a wide range of political systems, institutions, exchange rate arrangements, and historic circumstances","confidence":"medium"},{"id":"c6","summary":"The claim that the public-debt threshold is 'similar in advanced and emerging economies' is asserted without any reported comparison statistic or test of equality, so similarity cannot be distinguished from underpowered failure to detect a difference.","dimension":"claims","abstractBasis":"We find that the threshold for public debt is similar in advanced and emerging economies","confidence":"medium"},{"id":"c7","summary":"The reverse-causation possibility — that low or negative growth raises the debt/GDP ratio rather than high debt lowering growth — is a first-order threat to the inflection claim and is not acknowledged or addressed anywhere in the abstract.","dimension":"claims","abstractBasis":"Above 90 percent, median growth rates fall by one percent","confidence":"high"},{"id":"c8","summary":"Constructing comparable real GDP growth, inflation, and debt/GDP series across 44 countries over ~200 years involves substantial measurement and historical-data quality issues that the abstract does not discuss, despite 'new data' being central to the contribution.","dimension":"measurement","abstractBasis":"Our analysis is based on new data on forty-four countries spanning about two hundred years","confidence":"medium"},{"id":"c9","summary":"The inflation finding for advanced countries is immediately qualified by a counterexample (the US) within the same sentence, suggesting the 'no apparent contemporaneous link' conclusion may be sensitive to country composition and is at minimum internally tensioned.","dimension":"claims","abstractBasis":"there is no apparent contemporaneous link between inflation and public debt levels for the advanced countries as a group (some countries, such as the United States, have experienced higher inflation when debt/GDP is high)","confidence":"medium"},{"id":"c10","summary":"The abstract does not state whether the underlying dataset and code are available, which is material given that the central results are aggregations whose reproducibility depends on the specific data construction and category boundaries.","dimension":"data_code","abstractBasis":"Our analysis is based on new data on forty-four countries spanning about two hundred years","confidence":"low"}],"expertFlaws":[{"id":"f1","claim":"A spreadsheet (Excel) coding error caused several countries (e.g., Australia, Austria, Belgium, Canada, Denmark) to be silently omitted from the high-debt average, mechanically lowering the reported growth rate for the 90%+ debt category.","detectability":"fulltext_or_external","justification":"A literal coding/formula error in the authors' private spreadsheet cannot be inferred from the abstract. The abstract reports headline numbers (median growth falls 1%, average 'considerably more') but exposes nothing about the computation; the error was only provable by obtaining and inspecting the underlying workbook."},{"id":"f2","claim":"Selective exclusion of available country-year data (e.g., New Zealand 1946-49, early postwar Canada/Australia) dropped low-growth-erasing or high-growth high-debt episodes, biasing the high-debt average downward.","detectability":"fulltext_or_external","justification":"Which specific country-years were available but excluded is invisible from the abstract. The abstract advertises '3,700 annual observations' and 'forty-four countries' as comprehensive, but identifying that particular usable observations were omitted requires the full dataset and documentation, not the summary."},{"id":"f3","claim":"An unconventional weighting scheme (weighting each country equally rather than by the number of country-year observations, so a single year for one country counts as much as decades for another) distorted the high-debt growth estimate.","detectability":"fulltext_or_external","justification":"The abstract does not state how observations were aggregated into the per-category growth statistics. The choice between country-weighting and observation-weighting is a methodological detail disclosed (or discoverable) only in the full paper and data, not anticipatable from the abstract's wording."},{"id":"f4","claim":"Correcting all three problems raises average real growth for the high-debt (>90%) group to +2.2% from the published -0.1%, so the headline 90% threshold / cliff in growth does not exist.","detectability":"fulltext_or_external","justification":"The reversal of the central result is the OUTPUT of the reanalysis. It could only be established by re-running corrected computations on the data; nothing in the abstract lets a reader predict the magnitude or direction of the correction."},{"id":"f5","claim":"The headline causal/policy reading -- that crossing a 90% debt/GDP threshold sharply depresses growth -- rests on a discontinuity (median growth falls 1%, average falls much more above 90%) that the abstract itself presents as a sharp threshold despite an associational, observational design.","detectability":"abstract_detectable","justification":"The abstract itself frames a specific bright-line threshold (90%) at which 'median growth rates fall' and 'average growth falls considerably more' from purely cross-country/historical correlations. A careful reader can flag that a knife-edge threshold drawn from heterogeneous observational data, and the implied causal policy claim, is fragile to coding/sample/weighting choices -- exactly the robustness antecedent the replication exploited. The specific errors are not abstract-detectable, but the over-reliance on a single contested threshold statistic is."},{"id":"f6","claim":"Pooling forty-four heterogeneous countries across ~200 years with wildly unequal numbers of observations per country invites sensitivity to how unequal-length panels are averaged.","detectability":"abstract_detectable","justification":"The abstract explicitly advertises 'forty-four countries spanning about two hundred years' with '3,700 annual observations' across diverse institutions and regimes. A careful reader can anticipate that averaging such unbalanced panels is weighting-sensitive and that the reported category averages depend heavily on aggregation choices -- a legitimate robustness concern flagged from the abstract's own description, even though the actual mis-weighting was only confirmed in the data."}],"strict":{"judgements":[{"flawId":"f5","matched":true,"matchedConcernId":"c1","evidence":"Flaw f5 is that a sharp 90% threshold/discontinuity (median falls 1%, average falls much more) is over-relied upon and given a causal/policy reading despite an associational observational design. Concern c1 directly names this: 'a causal reading (debt depresses growth) that the stated descriptive methods cannot support', framed around the same threshold structure. This is the same underlying problem -- a contested threshold statistic carrying a causal/policy weight it cannot bear. Concerns c2 (threshold not estimated/tested) and c7 (reverse causation) also overlap, but c1 most directly matches the over-reliance-on-a-contested-causal-threshold core of f5."},{"flawId":"f6","matched":true,"matchedConcernId":"c5","evidence":"Flaw f6 is sensitivity to how unbalanced panels (44 countries, ~200 years, unequal observations per country) are averaged/weighted. Concern c5 names exactly this: pooling ~3,700 country-year observations 'risks treating serially correlated, non-independent country-year observations as exchangeable' and reports 'no method for handling autocorrelation, country weighting, or panel structure.' The explicit reference to country weighting and panel structure across the unbalanced 44-country/200-year panel substantively matches the weighting-sensitivity-of-unequal-length-panels problem in f6."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f5","matched":true,"matchedConcernId":"c1","evidence":"Flaw f5 concerns the headline causal/policy reading resting on a contested sharp 90% threshold from associational/observational data, fragile to coding/sample/weighting choices. Concern c1 directly identifies the over-reliance on a correlational threshold that invites an unsupported causal reading: 'a causal reading (debt depresses growth) that the stated descriptive methods cannot support.' This captures the same underlying weakness -- a causal/policy claim built on a correlational threshold statistic. Concerns c2 (whether 90% was estimated vs imposed, no breakpoint test) and c7 (reverse causation) reinforce the same fragility, but c1 most directly matches the over-reliance on the contested threshold statistic with an implied causal reading."},{"flawId":"f6","matched":true,"matchedConcernId":"c5","evidence":"Flaw f6 concerns pooling 44 heterogeneous countries across ~200 years with unequal observations per country, making category averages weighting/aggregation-sensitive. Concern c5 directly anticipates this: 'Pooling roughly 3,700 annual observations across 44 countries and 200 years... the abstract reports no method for handling autocorrelation, country weighting, or panel structure.' The explicit mention of 'country weighting' and panel structure for unbalanced data substantively matches the weighting-sensitivity of averaging unequal-length panels that f6 flags."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":["f6"],"voidedMatches":[],"detectabilityDisputes":[],"rejectedFlags":[]}},{"slug":"osc-reproducibility","targetTitle":"Estimating the reproducibility of psychological science","targetDoi":"10.1126/science.aac4716","openalexId":"https://openalex.org/W1897139626","aiRelated":false,"field":"Psychology (metascience)","expertSource":"Gilbert, King, Pettigrew & Wilson (2016), Science, 'Comment on Estimating the reproducibility of psychological science'","expertCritique":"Argues the Reproducibility Project contains three statistical errors (low-power replications, non-representative study sampling, and misleading endorsement criteria) that bias the reported replication rate downward. Concludes the data are actually consistent with very high reproducibility, not the low rate the original claimed.","blindConcerns":[{"id":"c1","summary":"The headline framing reports multiple discordant 'reproducibility' rates (36%, 47%, 39%, 68%) without the abstract committing to which is the operative estimate, so the single-number takeaway is underdetermined by the evidence presented.","dimension":"claims","abstractBasis":"Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed... 68% with statistically significant effects","confidence":"high"},{"id":"c2","summary":"One of the five reported reproducibility metrics relies on a subjective human judgment, which the abstract does not characterize for inter-rater reliability or blinding, leaving that 39% figure of uncertain objectivity.","dimension":"measurement","abstractBasis":"39% of effects were subjectively rated to have replicated the original result","confidence":"medium"},{"id":"c3","summary":"The 68% figure is explicitly conditional on an untested assumption of no bias in original results, yet it is presented alongside the other rates as if comparably evidential, which risks readers treating an assumption-laden ceiling as an empirical finding.","dimension":"statistics","abstractBasis":"if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects","confidence":"high"},{"id":"c4","summary":"The sample is 100 studies drawn from only three psychology journals, but the title generalizes to 'the reproducibility of psychological science' broadly, a scope gap between the evidence base and the headline claim.","dimension":"generalisation","abstractBasis":"100 experimental and correlational studies published in three psychology journals; title: Estimating the reproducibility of psychological science","confidence":"high"},{"id":"c5","summary":"The abstract does not state how the 100 studies were selected from the candidate pool in the three journals, so whether the estimate is representative or subject to selection effects cannot be assessed from the abstract alone.","dimension":"methods","abstractBasis":"We conducted replications of 100 experimental and correlational studies published in three psychology journals","confidence":"medium"},{"id":"c6","summary":"Using statistical significance of the replication as a reproducibility criterion conflates effect existence with the dichotomous p<.05 outcome, and the abstract reports significance counts (97% vs 36%) without acknowledging that low replication power or smaller true effects, rather than non-reproducibility, could drive the gap.","dimension":"statistics","abstractBasis":"Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results","confidence":"medium"},{"id":"c7","summary":"'Original materials when available' implies some replications used non-original materials, a heterogeneity that could systematically depress observed replication and is not quantified in the abstract.","dimension":"methods","abstractBasis":"using high-powered designs and original materials when available","confidence":"medium"},{"id":"c8","summary":"The causal-sounding claim that replication success was 'better predicted by the strength of original evidence' rests on correlational tests, and the abstract does not report effect sizes or controls, so the comparative predictor claim is asserted more strongly than the stated method supports.","dimension":"overclaiming","abstractBasis":"Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams","confidence":"medium"},{"id":"c9","summary":"The claim that replication effects represent a 'substantial decline' interprets a halving of effect magnitude as decline, but regression to the mean and publication-driven inflation of originals are plausible alternative explanations the abstract does not address.","dimension":"claims","abstractBasis":"Replication effects were half the magnitude of original effects, representing a substantial decline","confidence":"medium"}],"expertFlaws":[{"id":"f1","claim":"Many replications were themselves underpowered, so a non-significant replication is expected by sampling error even when the original effect is real; the headline 'low replication rate' conflates true failure to replicate with the replications' own statistical limitations.","detectability":"abstract_detectable","justification":"The abstract foregrounds 'high-powered designs' as a virtue and reports the 36% significant-replication rate as its central criterion. A careful reader can interrogate this: significance-counting depends entirely on replication power, and the abstract gives no per-study power evidence, only the aspirational descriptor. The antecedent risk that the dichotomous significance criterion is sensitive to replication power is visible in the abstract's own framing. (Gilbert et al.'s specific quantitative demonstration of fidelity-related power loss is external, but the power-dependence concern is anticipatable.)"},{"id":"f2","claim":"The 100 studies were not a representative random sample of the psychology literature; studies were selected under feasibility and assignment constraints, so the aggregate 'reproducibility' estimate cannot be generalized to the field as the abstract implies.","detectability":"abstract_detectable","justification":"The abstract draws a population-level inference ('the extent to which [reproducibility] characterizes current research') from '100 ... studies published in three psychology journals.' The leap from a convenience set of 100 studies in three journals to a claim about 'current research' is an over-generalisation visible in the abstract's own wording, and the absence of any stated random-sampling procedure is itself the flaggable antecedent."},{"id":"f3","claim":"The 'subjective replication' endorsement criterion (and the family of dichotomous success metrics) is misleading: subjective ratings and significance-counting are noisy, low-reliability gauges that understate agreement between original and replication; better metrics (e.g., whether the original effect falls in the replication CI) suggest much higher reproducibility.","detectability":"abstract_detectable","justification":"The abstract itself lists five discordant metrics side by side (36% significant, 47% in CI, 39% subjectively rated, 68% combined), exposing that the 'low reproducibility' headline depends on which criterion is chosen. A careful reader can flag that the conclusion is criterion-dependent and that the subjective-rating and significance criteria are contestable choices that drive the pessimistic number. The headline's reliance on a contested choice of metric is self-evident from the abstract's own spread of figures."},{"id":"f4","claim":"Comparing the magnitude of replication effects to original effects to infer 'a substantial decline' ignores regression to the mean and selection/publication bias inflating originals, so the halving of effect sizes does not by itself index irreproducibility.","detectability":"abstract_detectable","justification":"The abstract reports '97% of original studies had statistically significant results' alongside the claim that replication effects were 'half the magnitude,' framed as 'a substantial decline.' A methodologist reading only the abstract can note that near-universal original significance signals selection/publication bias and that effect-size shrinkage is the expected consequence of regression to the mean, making the causal 'decline' language an over-reach detectable from the abstract's own numbers."},{"id":"f5","claim":"Replications used 'original materials when available' — i.e., not always — and protocol/infidelity differences (population, setting, procedure) plausibly account for non-replications rather than the originals being false positives.","detectability":"abstract_detectable","justification":"The abstract's own hedge 'when available' concedes that some replications departed from original materials. A careful reader can flag protocol infidelity as an alternative explanation for low replication that the abstract does not rule out. The specific demonstration that low-fidelity replications failed more often required the full data (external), but the antecedent risk is exposed by the abstract's wording."},{"id":"f6","claim":"Gilbert et al.'s reanalysis using a high-fidelity benchmark (independent direct-replication confidence intervals) shows the observed replication rate is statistically consistent with ~100% true reproducibility once expected sampling error and infidelity are accounted for.","detectability":"fulltext_or_external","justification":"This positive counter-claim required external data (the Many Labs project's repeated-replication estimates of expected agreement) and a quantitative reanalysis the original paper did not contain and the abstract could not reveal. It is only establishable by bringing in outside information, not by reading the abstract."},{"id":"f7","claim":"Endorsement/CI-overlap metrics were not adjusted for the error rate of the replication estimates themselves, so the count of 'replications whose CI excludes the original effect' overstates failures because some exclusions arise purely from replication noise.","detectability":"fulltext_or_external","justification":"Diagnosing how the CI-based concordance counts were computed and showing they fail to account for the replications' own measurement error requires inspecting the project's per-study statistics and the metric definitions in the full paper/appendices, not just the abstract's summary percentages."}],"strict":{"judgements":[{"flawId":"f1","matched":true,"matchedConcernId":"c6","evidence":"c6 states that 'using statistical significance of the replication as a reproducibility criterion conflates effect existence with the dichotomous p<.05 outcome' and that 'low replication power or smaller true effects, rather than non-reproducibility, could drive the gap.' This substantively identifies the same flaw as f1: that a non-significant replication is expected by sampling error / power limitations even when the original effect is real. c6 explicitly names replication power as the alternative explanation for the 97% vs 36% gap, matching f1's core point."},{"flawId":"f2","matched":true,"matchedConcernId":"c5","evidence":"f2 concerns non-representative selection of the 100 studies undermining generalization. c5 directly identifies this: 'The abstract does not state how the 100 studies were selected from the candidate pool... so whether the estimate is representative or subject to selection effects cannot be assessed.' This is the same methodological problem (selection/non-random sampling). c4 is the related but distinct scope/generalization-of-title issue; c5 is the more precise match to f2's sampling-representativeness claim."},{"flawId":"f3","matched":true,"matchedConcernId":"c1","evidence":"f3's core is that the 'low reproducibility' headline is criterion-dependent — multiple discordant metrics yield different conclusions, and the pessimistic number depends on contested choices (subjective rating, significance-counting). c1 substantively captures this: it flags 'multiple discordant reproducibility rates (36%, 47%, 39%, 68%) without the abstract committing to which is the operative estimate, so the single-number takeaway is underdetermined.' This identifies the same criterion-dependence problem. c2 (subjective rating reliability) is a narrower component but c1 names the central spread-of-metrics flaw f3 describes."},{"flawId":"f4","matched":true,"matchedConcernId":"c9","evidence":"f4 states that inferring 'substantial decline' from halved effect sizes ignores regression to the mean and publication/selection bias inflating originals. c9 is a near-verbatim substantive match: 'The claim that replication effects represent a substantial decline interprets a halving of effect magnitude as decline, but regression to the mean and publication-driven inflation of originals are plausible alternative explanations.' Same alternative explanations, same flaw."},{"flawId":"f5","matched":true,"matchedConcernId":"c7","evidence":"f5 concerns 'original materials when available' implying protocol infidelity as an alternative explanation for non-replication. c7 substantively matches: \"'Original materials when available' implies some replications used non-original materials, a heterogeneity that could systematically depress observed replication.\" Both flag the 'when available' hedge and protocol/material infidelity as an unquantified alternative explanation for low replication."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f1","matched":true,"matchedConcernId":"c6","evidence":"c6 states: 'low replication power or smaller true effects, rather than non-reproducibility, could drive the gap' and notes significance-counting 'conflates effect existence with the dichotomous p<.05 outcome.' This substantively identifies the same underlying problem as f1: that a non-significant replication is expected under low replication power even when the original effect is real, so the significance-based 'low replication rate' conflates true failure with the replications' own statistical limitations. The power-dependence of the dichotomous criterion is the shared core."},{"flawId":"f2","matched":true,"matchedConcernId":"c5","evidence":"c5 directly flags that 'the abstract does not state how the 100 studies were selected from the candidate pool... so whether the estimate is representative or subject to selection effects cannot be assessed.' This is the same flaw as f2 (no representative random sample; estimate cannot be generalized to the field). c4 also touches the generalisation gap but frames it as journal-scope; c5 is the substantive match on the sampling/representativeness problem that f2 centers on."},{"flawId":"f3","matched":true,"matchedConcernId":"c1","evidence":"c1 identifies that 'multiple discordant reproducibility rates (36%, 47%, 39%, 68%)' are reported and 'the single-number takeaway is underdetermined,' and c2 separately flags the subjective 39% metric's uncertain objectivity (no inter-rater reliability/blinding). Together with the abstract's own spread, c1 substantively matches f3's core claim that the 'low reproducibility' headline is criterion-dependent and that dichotomous/subjective metrics are contestable choices driving the pessimistic number. c1 is the strongest match on the criterion-dependence point."},{"flawId":"f4","matched":true,"matchedConcernId":"c9","evidence":"c9 states the 'substantial decline' interpretation of halved effect magnitude has 'regression to the mean and publication-driven inflation of originals' as 'plausible alternative explanations the abstract does not address.' This is essentially verbatim the same flaw as f4: effect-size halving does not by itself index irreproducibility because regression to the mean and selection/publication bias inflate originals."},{"flawId":"f5","matched":true,"matchedConcernId":"c7","evidence":"c7 flags that \"'original materials when available' implies some replications used non-original materials, a heterogeneity that could systematically depress observed replication.\" This substantively matches f5: protocol/materials infidelity is an alternative explanation for non-replication rather than the originals being false positives. Both seize on the 'when available' hedge as exposing protocol infidelity as an unruled-out confound."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":["f3"],"voidedMatches":[],"detectabilityDisputes":[],"rejectedFlags":[{"concernId":"c4","kind":"ungrounded","note":"False alarm: the blindness auditor saw only the abstract, but the engine also had the title. c4 critiques the title (\"the reproducibility of psychological science\") for over-generalising a three-journal sample — a legitimate, title-grounded overclaiming critique, not fabricated rigour. Flag rejected."}]}},{"slug":"ego-depletion-original","targetTitle":"Methylphenidate Blocks Effort-Induced Depletion of Regulatory Control in Healthy Volunteers","targetDoi":"10.1177/0956797614526415","openalexId":"https://openalex.org/W2055799459","aiRelated":false,"field":"Psychology / neuroscience","expertSource":"Hagger et al. (2016), Registered Replication Report, Perspectives on Psychological Science","expertCritique":"A preregistered Registered Replication Report across 23 labs (~2000 participants) tested the sequential-task ego-depletion effect and found a meta-analytic effect indistinguishable from zero (d ~ 0.04), so the regulatory-depletion phenomenon this paper presumes and builds on is not robust.","blindConcerns":[{"id":"c1","summary":"The headline claim of \"fully blocks\" effort-induced depletion is an absolute, all-or-nothing assertion that is rarely warranted from a single study and invites overinterpretation of what may be a partial or noisy attenuation effect.","dimension":"overclaiming","abstractBasis":"methylphenidate (commonly known as Ritalin) ... fully blocks effort-induced depletion of regulatory control","confidence":"high"},{"id":"c2","summary":"The abstract reports no sample size, effect size, or significance values, so the reader cannot judge whether the \"fully blocks\" conclusion rests on adequate statistical power or is an underpowered single-study finding.","dimension":"statistics","abstractBasis":"Using a placebo-controlled, double-blind design, we demonstrated that the psychostimulant methylphenidate ... fully blocks effort-induced depletion","confidence":"high"},{"id":"c3","summary":"Claiming \"specificity\" of the drug effect to a single frequency band requires demonstrating null effects in other bands; absent reported comparisons or correction for multiple comparisons across bands, a single significant band may reflect selective reporting rather than genuine specificity.","dimension":"statistics","abstractBasis":"Spectral analysis of trial-by-trial reaction times revealed specificity of methylphenidate effects on regulatory depletion in the slow-4 frequency band","confidence":"high"},{"id":"c4","summary":"The leap from a behavioral RT spectral signature to claims about resting-state brain networks and mind wandering is a reverse-inference: an association of a band with networks elsewhere does not establish that those networks drove this effect, and no neuroimaging is described.","dimension":"claims","abstractBasis":"This band is associated with the operation of resting-state brain networks that produce mind wandering, which raises potential connections between our results and recent brain-network-based models","confidence":"high"},{"id":"c5","summary":"The mechanistic framing attributes the effect to dopamine and norepinephrine increases, but a behavioral/RT study cannot isolate which catecholamine (or neither) is responsible, so the synaptic mechanism is asserted rather than tested.","dimension":"overclaiming","abstractBasis":"a catecholamine reuptake blocker that increases dopamine and norepinephrine at the synaptic cleft, fully blocks effort-induced depletion","confidence":"medium"},{"id":"c6","summary":"Findings in healthy volunteers are framed as relevant to everyday problems and psychiatric conditions, but the abstract provides no clinical or applied data to support that generalization.","dimension":"generalisation","abstractBasis":"in Healthy Volunteers ... Regulatory depletion is thought to play an important role in everyday problems (e.g., excessive spending, overeating) as well as psychiatric conditions","confidence":"medium"},{"id":"c7","summary":"The abstract presents ego-depletion as an established phenomenon by citing volume of prior studies, but the cited literature wave is precisely the body whose replicability has been widely contested, so the depletion premise the whole study rests on is treated as settled when it is not.","dimension":"claims","abstractBasis":"A recent wave of studies--more than 100 conducted over the last decade--has shown that exerting effort ... leaves a person depleted","confidence":"medium"},{"id":"c8","summary":"Key operational details (the regulation/depletion task, dose, timing, number of trials, outcome measure) are absent from the abstract, limiting independent appraisal and reproducibility of the central claim.","dimension":"reproducibility","abstractBasis":"we demonstrated that the psychostimulant methylphenidate ... fully blocks effort-induced depletion of regulatory control","confidence":"medium"},{"id":"c9","summary":"The interpretive link to mind-wandering networks is hedged as merely \"potential,\" yet it is foregrounded in the abstract's conclusion, risking a speculative connection being read as a finding.","dimension":"overclaiming","abstractBasis":"which raises potential connections between our results and recent brain-network-based models of control over attention","confidence":"low"}],"expertFlaws":[{"id":"f1","claim":"The foundational ego-depletion (sequential-task regulatory depletion) effect that the paper presumes is not robust: a 23-lab preregistered replication (~2000 participants) found a meta-analytic effect indistinguishable from zero (d ~ 0.04).","detectability":"fulltext_or_external","justification":"The non-robustness of the depletion effect was established only by a large external multi-lab replication conducted after the fact. No reader of this abstract could know in advance that the meta-analytic effect would collapse to ~zero; this required running new data across 23 labs. The abstract presents depletion as established ('more than 100 studies'), and the failure is an empirical fact about the broader literature, not derivable from this abstract's wording."},{"id":"f2","claim":"The paper treats regulatory depletion as a settled, well-replicated phenomenon ('more than 100 studies over the last decade') and builds its entire claim on that premise, rather than treating it as a contested effect requiring independent verification.","detectability":"abstract_detectable","justification":"The abstract itself foregrounds reliance on a contested phenomenon: it presupposes depletion exists and frames the study as explaining its 'neurophysiological basis.' A careful reader can flag that the central claim is entirely contingent on the reality of an effect the paper does not itself establish, and that a citation-count appeal ('more than 100 studies') is not evidence of robustness. The dependence on a presumed effect is visible in the abstract's own wording, even if the specific replication outcome is not."},{"id":"f3","claim":"If the baseline depletion effect is itself absent/near-zero, there is no genuine depletion for methylphenidate to 'block,' so the headline causal claim that the drug reverses a real regulatory-control deficit is unsupported.","detectability":"abstract_detectable","justification":"This is a logical/internal-consistency concern derivable from the abstract: the claim that a drug 'fully blocks' depletion presupposes a reliable depletion effect to begin with. A careful reader can note the strong causal headline rests on a manipulation whose reliability is assumed, not demonstrated, so the interpretation is conditional on a fragile antecedent. The reader cannot know the antecedent fails (that is f1), but can flag that the headline's validity is fully hostage to it."},{"id":"f4","claim":"The study is a single-lab demonstration whose central effect (a drug-by-depletion interaction) is reported with strong, definitive language ('fully blocks', 'demonstrated specificity') without independent replication.","detectability":"abstract_detectable","justification":"The abstract uses unhedged, definitive causal wording ('fully blocks effort-induced depletion', 'demonstrated... specificity') for a single study. A careful reader can legitimately flag over-strong claiming plus single-lab robustness risk as antecedent concerns, which is exactly the vulnerability the large multi-lab RRR exploited. This is the detectable methodological antecedent (single demonstration plus strong claim) phrased as a robustness risk, not a proven failure."},{"id":"f5","claim":"The secondary mechanistic finding (spectral analysis localizing methylphenidate effects to the slow-4 band, linked to mind-wandering brain networks) is an exploratory, post-hoc neurophysiological interpretation layered on top of the unverified depletion effect.","detectability":"abstract_detectable","justification":"The abstract's own language ('raises potential connections', frequency-band specificity tied to 'recent brain-network-based models') signals a speculative, post-hoc interpretive layer. A careful reader can flag that this mechanistic story inherits the fragility of the underlying depletion claim and adds further inferential distance (a specific frequency band, a specific network theory) without confirmation. Its speculative status is visible in the abstract, though the underlying phenomenon's failure (f1) is not."}],"strict":{"judgements":[{"flawId":"f2","matched":true,"matchedConcernId":"c7","evidence":"c7 states the abstract 'presents ego-depletion as an established phenomenon by citing volume of prior studies, but the cited literature wave is precisely the body whose replicability has been widely contested, so the depletion premise the whole study rests on is treated as settled when it is not.' This substantively matches f2's core point: the paper treats regulatory depletion as settled/well-replicated ('more than 100 studies') and builds its entire claim on that contested premise. Both identify the citation-count appeal and the unwarranted treatment of a contested effect as settled."},{"flawId":"f3","matched":false,"matchedConcernId":"","evidence":"f3 is a specific internal-consistency/logical concern: if the baseline depletion effect is absent/near-zero, there is nothing for the drug to 'block,' so the headline causal claim is hostage to a fragile antecedent. No blind concern makes this conditional-dependency argument. c7 questions whether depletion is established, and c1 questions the 'fully blocks' overclaim, but neither articulates that the drug's blocking claim logically presupposes and is wholly contingent on a reliable depletion effect existing. The specific 'nothing to block if no depletion' logic is absent."},{"flawId":"f4","matched":true,"matchedConcernId":"c2","evidence":"f4 concerns a single-lab demonstration reported with definitive language ('fully blocks', 'demonstrated specificity') lacking independent replication. c2 flags that 'the reader cannot judge whether the \"fully blocks\" conclusion rests on adequate statistical power or is an underpowered single-study finding' — directly naming the single-study robustness risk. c1 separately flags the overclaiming language. c2 most directly captures the single-lab/replication robustness vulnerability that f4 describes, framing it as an underpowered single-study concern."},{"flawId":"f5","matched":true,"matchedConcernId":"c4","evidence":"f5 describes the spectral/slow-4 mind-wandering-network finding as an exploratory, post-hoc, speculative neurophysiological interpretation layered on the study. c4 identifies this as a 'reverse-inference' where 'an association of a band with networks elsewhere does not establish that those networks drove this effect, and no neuroimaging is described,' and c9 flags the 'potential connections' as speculative being foregrounded as a finding. c4 substantively captures the speculative/post-hoc inferential-distance problem with the brain-network interpretation that f5 describes."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f2","matched":true,"matchedConcernId":"c7","evidence":"c7 states the abstract 'presents ego-depletion as an established phenomenon by citing volume of prior studies, but the cited literature wave is precisely the body whose replicability has been widely contested, so the depletion premise the whole study rests on is treated as settled when it is not.' This substantively matches f2's claim that the paper treats regulatory depletion as a settled, well-replicated phenomenon ('more than 100 studies') and builds its entire claim on that premise rather than treating it as contested. Both identify the citation-count appeal and the unwarranted treatment of a contested effect as settled."},{"flawId":"f3","matched":false,"matchedConcernId":"","evidence":"f3 is a specific logical/internal-consistency point: if the baseline depletion effect is itself absent/near-zero, there is no genuine depletion to 'block,' so the headline causal claim is unsupported. No blind concern makes this conditional-dependency argument. c1 challenges 'fully blocks' as overclaiming an all-or-nothing effect, and c7 challenges the depletion premise as contested, but neither articulates the specific logical point that the 'blocks' claim is hostage to the antecedent reliability of the depletion effect (i.e., that there must be a real effect to block). The closest is c7, but it argues about replicability/settledness generally, not the internal-consistency dependency that the headline is void if the antecedent fails."},{"flawId":"f4","matched":true,"matchedConcernId":"c2","evidence":"f4 flags a single-lab demonstration reported with strong, definitive language ('fully blocks', 'demonstrated specificity') without independent replication. c2 captures the single-study robustness risk directly: 'whether the \"fully blocks\" conclusion rests on adequate statistical power or is an underpowered single-study finding.' Combined with c1's challenge to the unhedged 'fully blocks' absolute claiming, the core of f4 (single demonstration + over-strong claiming = robustness risk) is substantively identified. c2 names the single-study robustness vulnerability that is the heart of f4."},{"flawId":"f5","matched":true,"matchedConcernId":"c4","evidence":"f5 flags the spectral/slow-4/mind-wandering-network finding as an exploratory, post-hoc neurophysiological interpretation layered on the unverified depletion effect. c4 substantively identifies this: it flags the 'leap from a behavioral RT spectral signature to claims about resting-state brain networks and mind wandering' as a reverse-inference with no neuroimaging, and c9 notes the 'potential' connection is foregrounded as if a finding. c4 captures the speculative, inferentially-distant nature of the mechanistic/network interpretation that is the core of f5."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":["f4"],"voidedMatches":[{"flawId":"f2","concernId":"c7","reason":"Leakage: concern c7 asserts the depletion literature's \"replicability has been widely contested\" — outside replication-crisis knowledge a blind abstract-only reader cannot have. The honest abstract-only point (the paper assumes its premise) is fine, but the credited match leaned on the leaked claim, so it is voided. The match auditor upheld f2<-c7; the independent blindness auditor caught the contamination."}],"detectabilityDisputes":[],"rejectedFlags":[{"concernId":"c6","kind":"ungrounded","note":"False alarm: \"healthy volunteers\" appears in the title (which the engine had) though not the abstract body. Title-grounded, not fabricated. c6 was not a credited match. Flag rejected."}]}},{"slug":"power-posing-original","targetTitle":"Power Posing","targetDoi":"10.1177/0956797610383437","openalexId":"https://openalex.org/W2152928398","aiRelated":false,"field":"Social psychology","expertSource":"Ranehill et al. (2015), Psychological Science (large replication)","expertCritique":"A larger, better-powered replication (N=200) replicated only self-reported feelings of power but found no effect of expansive postures on testosterone, cortisol, or behavioral risk tolerance. Challenges the central physiological and behavioral claims of the original power-posing paper.","blindConcerns":[{"id":"c1","summary":"The abstract reports a single study yet draws broad causal physiological and behavioral conclusions, with no sample size disclosed, leaving statistical power and reliability unverifiable.","dimension":"statistics","abstractBasis":"The results of this study confirmed our prediction","confidence":"high"},{"id":"c2","summary":"The headline causal claim that posing 'caused' neuroendocrine changes rests on a hormonal mechanism (testosterone up, cortisol down) that is notoriously noisy and confound-prone, but the abstract reports only directional effects with no effect sizes, baselines, or measurement detail.","dimension":"measurement","abstractBasis":"High-power posers experienced elevations in testosterone, decreases in cortisol","confidence":"high"},{"id":"c3","summary":"The abstract makes an unqualified causal claim from what appears to be a two-condition comparison without mentioning a neutral control, randomization, blinding, or pre-registration, so the inference from correlation-in-an-experiment to 'cause' is asserted rather than substantiated in the text.","dimension":"methods","abstractBasis":"But can these postures actually cause power? ... posing in displays of power caused advantaged and adaptive psychological, physiological, and behavioral changes","confidence":"high"},{"id":"c4","summary":"The closing sentence extrapolates from a lab manipulation to sweeping real-world prescriptions, treating a 1-minute pose as something that makes a person 'instantly become more powerful' — a generalization the within-study evidence described cannot support.","dimension":"overclaiming","abstractBasis":"That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.","confidence":"high"},{"id":"c5","summary":"Multiple distinct outcomes (testosterone, cortisol, felt power, risk tolerance) are reported as all confirming the prediction, raising concern about multiple comparisons and selective emphasis, yet the abstract gives no correction, p-values, or indication of which effects were primary versus exploratory.","dimension":"statistics","abstractBasis":"elevations in testosterone, decreases in cortisol, and increased feelings of power and tolerance for risk","confidence":"medium"},{"id":"c6","summary":"The abstract presents results entirely as confirmation of the authors' prior prediction without any mention of limitations, null findings, or boundary conditions, a confirmatory framing that warrants caution about possible confirmation bias.","dimension":"claims","abstractBasis":"The results of this study confirmed our prediction","confidence":"medium"},{"id":"c7","summary":"No information is given on replication, data, or code availability, so the strong actionable claims rest on a single unreplicated study as described in the abstract.","dimension":"reproducibility","abstractBasis":"The results of this study confirmed our prediction","confidence":"medium"},{"id":"c8","summary":"The terms 'advantaged and adaptive' and 'more powerful' conflate a transient self-report and hormone shift with durable real-world power and advantage, a measurement-to-construct leap not justified by the outcomes named.","dimension":"measurement","abstractBasis":"posing in displays of power caused advantaged and adaptive psychological, physiological, and behavioral changes","confidence":"medium"}],"expertFlaws":[{"id":"f1","claim":"The original study's neuroendocrine and behavioral effects (testosterone increase, cortisol decrease, risk tolerance) fail to replicate in a larger, better-powered sample (N=200), so the physiological and behavioral causal claims are not robust.","detectability":"fulltext_or_external","justification":"That these specific effects do NOT replicate is a fact established only by running a new, larger study and observing null results. The abstract reports positive findings with no indication they would vanish; a reader of only the abstract cannot know the directional hormone/behavior effects will not hold up in independent data."},{"id":"f2","claim":"The original study was underpowered: its sample was too small to reliably estimate the claimed hormonal and behavioral effects, making them vulnerable to false positives.","detectability":"abstract_detectable","justification":"The replication's headline contrast is 'larger, better-powered (N=200).' The original abstract never states a sample size, but a careful reader can flag that detecting neuroendocrine effects (testosterone/cortisol) across both sexes with two 1-min poses is a demanding inference; the absence of any reported N and the multiplicity of outcomes (hormonal + psychological + behavioral) are visible robustness/power risks anticipatable from the abstract's own scope, even though the exact N and power deficit require the full paper."},{"id":"f3","claim":"Only the self-reported 'feelings of power' effect replicates; the subjective self-report outcome is the weakest, most demand-susceptible measure and cannot license the physiological/behavioral conclusions.","detectability":"abstract_detectable","justification":"The abstract itself lists 'increased feelings of power' alongside the objective measures and frames a sweeping conclusion ('embody power and instantly become more powerful'). A careful reader can note that the self-report measure is susceptible to demand characteristics and that the strong causal headline rests partly on a subjective outcome — a legitimate concern from the abstract alone, even though knowing it is the ONLY surviving effect requires the replication."},{"id":"f4","claim":"The abstract over-generalizes to 'real-world, actionable implications' from a single laboratory study, asserting instant, universal effects not warranted by the evidence base.","detectability":"abstract_detectable","justification":"The over-claiming ('instantly become more powerful,' 'real-world, actionable implications,' 'embodiment extends... to physiology and subsequent behavioral choices') is visible in the abstract's own wording. A reader can flag that strong, generalizing, prescriptive language from one study outruns the demonstrated evidence — independent of the replication's null results."},{"id":"f5","claim":"The strong causal claim ('can these postures actually cause power?... posing... caused... changes') rests on a single original study without independent confirmation, a contested-phenomenon risk.","detectability":"abstract_detectable","justification":"The abstract foregrounds a bold causal question and answers it affirmatively from one study. Reliance on a single, not-yet-independently-confirmed demonstration for a strong causal claim is a robustness risk a careful reader can name from the abstract alone, even though the actual failure to confirm is only shown by the replication."},{"id":"f6","claim":"Cortisol and testosterone responses are noisy, assay- and timing-sensitive measures whose claimed clean directional effects may reflect uncontrolled measurement/analytic variability rather than a true posture effect.","detectability":"fulltext_or_external","justification":"Assessing the specific measurement, assay, timing, and analytic-flexibility issues behind the hormone results requires the full method section and ideally reanalysis; the abstract gives no methodological detail to diagnose this, so it is not anticipatable from the abstract alone beyond the generic power concern already in f2."}],"strict":{"judgements":[{"flawId":"f2","matched":true,"matchedConcernId":"c1","evidence":"f2 is the underpowered/too-small-sample flaw making effects vulnerable to false positives. c1 states 'no sample size disclosed, leaving statistical power and reliability unverifiable' and explicitly cites the multiplicity of broad conclusions from a single study. This substantively names the same power/sample-size concern. c5 (multiplicity) is adjacent but is about multiple comparisons rather than insufficient N; c1 is the direct power match."},{"flawId":"f3","matched":false,"matchedConcernId":"","evidence":"f3's specific problem is that the self-reported 'feelings of power' outcome is the weakest, most demand-characteristic-susceptible measure and cannot license the physiological/behavioral conclusions. The closest blind concern, c8, claims 'advantaged and adaptive' and 'more powerful' conflate transient self-report and hormone shift with durable power - a construct-validity/overclaiming point, not a demand-characteristics critique singling out the self-report measure as weakest. No concern names demand characteristics or isolates the subjective measure's vulnerability. c2 focuses on hormonal noise, not self-report. No substantive match."},{"flawId":"f4","matched":true,"matchedConcernId":"c4","evidence":"f4 is over-generalization from a single lab study to 'real-world, actionable implications' and instant/universal effects. c4 directly states the closing sentence 'extrapolates from a lab manipulation to sweeping real-world prescriptions, treating a 1-min pose as something that makes a person instantly become more powerful - a generalization the within-study evidence cannot support,' quoting the same abstract language. Clear substantive match."},{"flawId":"f5","matched":true,"matchedConcernId":"c7","evidence":"f5 is the contested-phenomenon/robustness risk: a strong causal claim resting on a single original study without independent confirmation. c7 states 'the strong actionable claims rest on a single unreplicated study as described in the abstract,' naming reliance on one unreplicated study for strong claims - the same single-study-confirmation robustness problem. c3 addresses causal inference from a two-condition design (internal validity), which is a different problem; c7 is the direct match to the single-study/replication risk."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f2","matched":true,"matchedConcernId":"c1","evidence":"Concern c1 states 'no sample size disclosed, leaving statistical power and reliability unverifiable' and flags 'broad causal physiological and behavioral conclusions' from a single study. This substantively matches f2's underpowered/too-small-sample concern: both point at the absence of any reported N undermining the reliability of the claimed hormonal and behavioral effects. c5 reinforces this on the multiplicity-of-outcomes angle that f2's justification also cites, but c1 most directly names the power/sample-size weakness."},{"flawId":"f3","matched":true,"matchedConcernId":"c8","evidence":"f3's abstract-detectable core is that the self-report 'feelings of power' is the weakest, demand-susceptible measure that cannot license the physiological/behavioral conclusions. c8 captures the construct-validity half ('conflate a transient self-report and hormone shift with durable real-world power... a measurement-to-construct leap'), flagging the self-report outcome as inadequate to support the sweeping 'more powerful' conclusion. While c8 does not use the words 'demand characteristics,' it substantively identifies the same weakness: the self-report measure being over-leveraged into conclusions it cannot bear. This is a substantive (not merely topical) match on the self-report-over-reach problem."},{"flawId":"f4","matched":true,"matchedConcernId":"c4","evidence":"c4 states the closing sentence 'extrapolates from a lab manipulation to sweeping real-world prescriptions,' that a 1-min pose 'instantly become more powerful' is 'a generalization the within-study evidence described cannot support.' This is essentially identical to f4's over-generalization to 'real-world, actionable implications' from a single lab study asserting instant, universal effects. Clear substantive match."},{"flawId":"f5","matched":true,"matchedConcernId":"c3","evidence":"f5 concerns the strong causal claim ('can these postures actually cause power?... caused... changes') resting on a single original study without independent confirmation. c3 quotes the same causal language and argues 'the unqualified causal claim... the inference from correlation-in-an-experiment to cause is asserted rather than substantiated,' flagging absence of control/randomization/replication. c7 also notes 'strong actionable claims rest on a single unreplicated study.' c3 most directly targets the causal-inference robustness risk f5 names, making this a substantive match on the contested-causal-claim weakness."}],"overclaimedConcerns":[{"concernId":"c6","why":"c6 asserts a 'confirmation bias' defect inferred purely from the phrase 'confirmed our prediction' and the absence of stated limitations in an abstract. Abstracts routinely omit limitations and state whether predictions were confirmed; this is normal abstract convention, not evidence of biased analysis. The claim of likely confirmation bias has no legitimate basis in the abstract text itself."}]},"audit":{"overturnedFlawIds":[],"voidedMatches":[],"detectabilityDisputes":[],"rejectedFlags":[]}},{"slug":"facial-feedback-original","targetTitle":"Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis.","targetDoi":"10.1037/0022-3514.54.5.768","openalexId":"https://openalex.org/W2169284121","aiRelated":false,"field":"Social psychology","expertSource":"Wagenmakers et al. (2016), Registered Replication Report, Perspectives on Psychological Science","expertCritique":"A Registered Replication Report of 17 direct replications found a pooled effect of 0.03 rating units (95% CI -0.11 to 0.16) versus the original 0.82, failing to replicate the claim that induced smiling increases rated funniness.","blindConcerns":[{"id":"c1","summary":"The headline causal claim that facial activity 'influences' affective responses rests on a between-conditions difference in self-reported humor ratings, with no effect size, sample size, or significance information disclosed in the abstract.","dimension":"statistics","abstractBasis":"Subjects reported more intense humor responses ... under facilitating conditions than under inhibiting conditions","confidence":"high"},{"id":"c2","summary":"The sole outcome appears to be a subjective self-report ('reported more intense humor responses'), so the affective effect is measured by demand-sensitive ratings rather than any independent behavioral or physiological index of emotion.","dimension":"measurement","abstractBasis":"Subjects reported more intense humor responses when cartoons were presented","confidence":"high"},{"id":"c3","summary":"The claim that the manipulation 'precluded labeling of the facial expression in emotion categories' and was 'nonobtrusive' is asserted as achieved but the abstract gives no evidence (e.g., manipulation check, suspicion probe) demonstrating subjects were actually unaware of the smiling-related purpose.","dimension":"methods","abstractBasis":"inhibiting conditions that precluded labeling of the facial expression in emotion categories","confidence":"medium"},{"id":"c4","summary":"Study 2 is described as a replication but is conducted by the same authors within the same paper using the same paradigm; an internal conceptual replication does not establish independent reproducibility.","dimension":"reproducibility","abstractBasis":"The results replicated Study 1's findings","confidence":"medium"},{"id":"c5","summary":"The dissociation claim that 'facial feedback operates on the affective but not on the cognitive component' is a claim of a null effect on cognition, which the abstract presents as a positive finding without indicating it was powered to detect such an absence.","dimension":"claims","abstractBasis":"facial feedback operates on the affective but not on the cognitive component of the humor response","confidence":"high"},{"id":"c6","summary":"The closing claim that 'both inhibitory and facilitatory mechanisms may have contributed' is hedged ('may have') and apparently inferred from the same data, risking post hoc decomposition without a design that independently isolates the two mechanisms against a true neutral baseline.","dimension":"overclaiming","abstractBasis":"the results suggested that both inhibitory and facilitatory mechanisms may have contributed to the observed affective responses","confidence":"medium"},{"id":"c7","summary":"Generalisation from holding a pen in the mouth while rating cartoons to the broad hypothesis that 'people's facial activity influences their affective responses' extrapolates from a narrow, artificial smile-muscle manipulation and a single emotion (humor) to facial feedback generally.","dimension":"generalisation","abstractBasis":"people's facial activity influences their affective responses","confidence":"medium"},{"id":"c8","summary":"The abstract claims to 'eliminate methodological problems of earlier experiments' but does not specify which confounds were removed or provide evidence that the new pen procedure introduces no new confounds (e.g., effort, distraction, or comfort differences between the two holding positions).","dimension":"methods","abstractBasis":"designed to both eliminate methodological problems of earlier experiments and clarify theoretical ambiguities","confidence":"medium"}],"expertFlaws":[{"id":"f1","claim":"The headline effect (induced smiling increases rated funniness) does not replicate: a pooled estimate across 17 high-powered direct replications is 0.03 rating units (95% CI -0.11 to 0.16), versus the original 0.82, indicating the original effect is essentially absent or massively overstated.","detectability":"fulltext_or_external","justification":"The non-replication itself is established only by running 17 new independent samples and pooling them. No reader of the abstract alone can know that the effect vanishes under direct replication; this is the definitional external-replication finding and cannot be anticipated from the abstract's text."},{"id":"f2","claim":"The original two-study finding rests on a small single-source sample, making the reported effect vulnerable to sampling variability and false-positive inflation typical of pre-power-revolution social psychology.","detectability":"abstract_detectable","justification":"The abstract presents only two studies from a single research group with no sample sizes or power justification, and frames a fragile manipulation as decisive. A careful reader can legitimately flag the small-N, single-lab, original-only basis as a robustness risk warranting independent replication, which is precisely the antecedent the RRR targeted."},{"id":"f3","claim":"The pen-in-mouth manipulation is a subtle, demand-prone, and fragile procedure whose effect on self-reported humor may not survive procedural variation (e.g., presence/absence of cameras, instructions, stimuli) across labs.","detectability":"abstract_detectable","justification":"The abstract itself describes the manipulation (holding a pen in the mouth to inhibit/facilitate smiling muscles) and the outcome (self-reported humor intensity). A reader can anticipate that such an indirect, self-report-based effect is sensitive to procedural and demand factors and may not generalize, which is the robustness concern the multi-lab RRR tested."},{"id":"f4","claim":"The abstract over-interprets the data by drawing fine-grained theoretical conclusions (facial feedback affects the affective but not the cognitive component; both inhibitory and facilitatory mechanisms contribute) that depend on an effect that itself does not reliably exist.","detectability":"abstract_detectable","justification":"The abstract layers strong mechanistic dissociation claims on top of the basic effect. A careful reader can note that these secondary theoretical claims are only meaningful if the primary effect is real and robust; their confident assertion from two small studies is an over-generalization visible in the abstract's own wording."},{"id":"f5","claim":"The reported original effect size (0.82) is implausibly large relative to what the construct can plausibly produce, consistent with publication-era effect-size inflation rather than a stable phenomenon.","detectability":"fulltext_or_external","justification":"The specific magnitude 0.82 and its implausibility are established only by comparison with the pooled replication estimate; the abstract gives no effect size, so the inflation cannot be quantified or detected from the abstract alone."}],"strict":{"judgements":[{"flawId":"f2","matched":true,"matchedConcernId":"c1","evidence":"f2 concerns the small single-source sample making the effect vulnerable to sampling variability and false-positive inflation, warranting independent replication. c1 directly flags that the headline causal claim rests on a between-conditions difference 'with no effect size, sample size, or significance information disclosed,' identifying the same robustness/false-positive risk from undisclosed N and power. c4 (internal-only replication does not establish independent reproducibility) reinforces the same antecedent but c1 most squarely names the small-N/false-positive concern. The match is substantive: both target the fragility of the effect given an undisclosed small sample from a single lab."},{"flawId":"f3","matched":true,"matchedConcernId":"c2","evidence":"f3 is that the pen-in-mouth manipulation is subtle, demand-prone, and fragile, with a self-report outcome that may not survive procedural variation across labs. c2 substantively identifies the demand-sensitivity of the self-report outcome: 'the affective effect is measured by demand-sensitive ratings rather than any independent behavioral or physiological index.' This names the same core methodological problem (an indirect, self-report-based, demand-prone effect). c3 (no manipulation/suspicion check evidence) also supports the fragility/demand theme, but c2 most directly matches the demand-prone self-report concern central to f3."},{"flawId":"f4","matched":true,"matchedConcernId":"c5","evidence":"f4 is over-interpretation: fine-grained theoretical claims (affective-but-not-cognitive dissociation; both inhibitory and facilitatory mechanisms) layered on an effect that may not be real. c5 directly targets one such claim, noting the affective-not-cognitive dissociation is 'a claim of a null effect on cognition ... presented as a positive finding without indicating it was powered to detect such an absence.' c6 targets the other secondary claim (inhibitory/facilitatory mechanisms inferred post hoc without a neutral baseline). Together these substantively match f4's over-interpretation point; c5 is the closest single match on the dissociation claim depending on a real primary effect."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f2","matched":false,"matchedConcernId":"c4","evidence":"f2 concerns the original two-study finding resting on a small single-source sample, vulnerable to sampling variability and false-positive inflation, warranting independent replication. The closest blind concerns are c1 (no effect size, sample size, or significance information disclosed) and c4 (Study 2 replication is internal/same-authors, not independent reproducibility). c1 flags the absence of sample-size/power disclosure but frames it as a reporting gap about the headline causal claim, not as a robustness/false-positive-inflation risk from a small single-lab sample. c4 does capture the 'same authors, internal replication does not establish independent reproducibility' angle, which overlaps with the 'single-source/single-lab' and 'warrants independent replication' core of f2. This is the strongest candidate. c4 substantively identifies that the original studies are single-source and that genuine independent reproduction is lacking, which is the antecedent the RRR targeted. I credit c4 as a charitable-lens match: it names the single-lab, internal-replication fragility that is precisely f2's concern about a small single-source basis needing independent replication."},{"flawId":"f3","matched":true,"matchedConcernId":"c2","evidence":"f3: the pen-in-mouth manipulation is subtle, demand-prone, and fragile, and its effect on self-reported humor may not survive procedural variation across labs. Multiple blind concerns converge on this: c2 explicitly flags the sole outcome as 'demand-sensitive ratings' rather than an independent index of emotion; c3 questions whether the 'nonobtrusive' manipulation actually concealed its purpose (a manipulation/demand-awareness concern); c8 flags that the new pen procedure may introduce new confounds (effort, distraction, comfort) between holding positions. Together these capture the subtlety/demand-proneness/fragility of the manipulation. c2 most directly matches the demand-sensitivity and self-report fragility at the heart of f3 — that the effect rests on demand-sensitive self-report of humor. I credit the match (using c2 as the primary anchor) because it substantively identifies the demand-prone, self-report-based fragility of the manipulation's effect."},{"flawId":"f4","matched":true,"matchedConcernId":"c5","evidence":"f4: the abstract over-interprets by drawing fine-grained theoretical conclusions (affective-but-not-cognitive dissociation; both inhibitory and facilitatory mechanisms contribute) that depend on an effect that may not reliably exist. c5 directly targets the dissociation claim ('facial feedback operates on the affective but not on the cognitive component') as an unsupported null-on-cognition claim presented as a positive finding without power justification. c6 targets the 'both inhibitory and facilitatory mechanisms may have contributed' claim as hedged, post hoc decomposition without a design isolating the two mechanisms. Both c5 and c6 substantively identify the same secondary-theoretical-overclaiming that f4 describes — layering strong mechanistic claims on top of a fragile basic effect. This is a clear substantive match."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":["f2"],"voidedMatches":[],"detectabilityDisputes":[],"rejectedFlags":[]}},{"slug":"colonial-origins","targetTitle":"The Colonial Origins of Comparative Development: An Empirical Investigation","targetDoi":"10.1257/aer.91.5.1369","openalexId":"https://openalex.org/W3124166904","aiRelated":false,"field":"Economics","expertSource":"Albouy (2012), American Economic Review, Comment","expertCritique":"Albouy shows that 36 of the 64 countries are assigned settler-mortality rates borrowed from other countries and that incomparable rates from laborers, bishops, and soldiers on campaign are combined in ways favorable to the institutions hypothesis. Once data problems are addressed, the mortality-expropriation relationship and the IV estimates lose robustness, often yielding effectively infinite confidence intervals.","blindConcerns":[{"id":"c1","summary":"The exclusion restriction for the instrument is asserted but not defended: European settler mortality is assumed to affect modern income only through institutions, yet historical mortality could plausibly proxy for present-day disease environment, geography, or human-capital channels that bear directly on income.","dimension":"methods","abstractBasis":"Exploiting differences in European mortality rates as an instrument for current institutions","confidence":"high"},{"id":"c2","summary":"Income gains are described as 'large' with no point estimate, confidence interval, or sample size in the abstract, so the magnitude and precision of the headline effect cannot be assessed.","dimension":"statistics","abstractBasis":"we estimate large effects of institutions on income per capita","confidence":"high"},{"id":"c3","summary":"'Institutions' are invoked as the causal mediator but never operationalized in the abstract, leaving unclear what is measured (property rights, rule of law, expropriation risk) and whether a single index can carry such a sweeping causal claim.","dimension":"measurement","abstractBasis":"estimate the effect of institutions on economic performance","confidence":"high"},{"id":"c4","summary":"The claim that early colonial institutions 'persisted to the present' is stated as fact rather than demonstrated, papering over centuries of intervening history (independence, reform, conflict) through which the proposed causal chain must survive.","dimension":"claims","abstractBasis":"These institutions persisted to the present.","confidence":"high"},{"id":"c5","summary":"The conclusion that Africa and equatorial countries lack lower incomes 'once institutions are controlled for' risks overclaiming, since it treats geography purely as a confounder to be partialled out while geography may operate partly through the same channel as the instrument and the mediator.","dimension":"overclaiming","abstractBasis":"countries in Africa or those closer to the equator do not have lower incomes","confidence":"medium"},{"id":"c6","summary":"Reducing 'different colonization policies' and 'very different institutions' across heterogeneous colonies to a binary settle/extract mechanism may oversimplify a wide range of historical experiences into a single mortality-driven story.","dimension":"generalisation","abstractBasis":"Europeans adopted very different colonization policies in different colonies, with different associated institutions","confidence":"medium"},{"id":"c7","summary":"The abstract gives no information about the provenance, comparability, or quality of the historical European mortality data, which is the linchpin of the entire identification strategy.","dimension":"data_code","abstractBasis":"We exploit differences in European mortality rates","confidence":"medium"},{"id":"c8","summary":"The sample is implicitly restricted to former European colonies, so the estimated institution-income relationship may not generalize beyond that selected set of countries.","dimension":"generalisation","abstractBasis":"Europeans adopted very different colonization policies in different colonies","confidence":"medium"}],"expertFlaws":[{"id":"f1","claim":"36 of the 64 settler-mortality observations are not actual mortality rates measured in the country itself, but rates imputed/borrowed from other countries, making the key instrument's data partly fabricated by assignment.","detectability":"fulltext_or_external","justification":"The abstract presents 'differences in European mortality rates' as a clean measured instrument and gives no hint about coverage, sourcing, or imputation of the rates. Discovering that more than half the rates are borrowed from other countries requires inspecting the underlying data appendix and original mortality sources, not the abstract."},{"id":"f2","claim":"The mortality series combines incomparable populations (laborers, bishops, soldiers on campaign) whose death rates are not on the same scale, and the combinations were made in directions favorable to the institutions hypothesis.","detectability":"fulltext_or_external","justification":"The abstract refers only to 'European mortality rates' as a single homogeneous instrument and says nothing about who the rates were measured on or how heterogeneous sources were spliced. Detecting that incomparable groups were combined, and combined in a results-favorable way, requires auditing the construction of the mortality variable in the full paper and primary sources."},{"id":"f3","claim":"Once the data problems are corrected, the first-stage mortality-expropriation relationship is no longer robust, undermining instrument relevance.","detectability":"fulltext_or_external","justification":"The strength of the first stage is an empirical property only revealed by re-running the regressions with corrected data; the abstract reports only that mortality was used 'as an instrument' and asserts large effects, with no first-stage diagnostics a reader could interrogate."},{"id":"f4","claim":"After fixing the data, the IV estimates lose robustness, often producing effectively infinite confidence intervals — i.e. the headline 'large effects of institutions' is not statistically identified.","detectability":"fulltext_or_external","justification":"This is a reanalysis result obtained by recoding the data and recomputing the IV; nothing in the abstract's wording exposes the fragility. The collapse to near-infinite CIs can only be demonstrated externally, not anticipated from the abstract's confident claim of 'large effects.'"},{"id":"f5","claim":"The identification rests entirely on a single instrument (settler mortality) whose validity and measurement are assumed rather than independently verified, creating a single point of failure for all downstream claims.","detectability":"abstract_detectable","justification":"The abstract itself states the entire causal estimate is driven by 'exploiting differences in European mortality rates as an instrument.' A careful reader can flag that a strong, sweeping causal conclusion hinging on one historical instrument is only as good as that instrument's measurement and exclusion validity — a legitimate robustness concern, even though the specific data errors are not visible."},{"id":"f6","claim":"The abstract makes strong, far-reaching causal claims (large effects of institutions; geography/Africa dummies become insignificant once institutions are controlled) from a cross-country observational design, which is vulnerable to fragile identification.","detectability":"abstract_detectable","justification":"The abstract's own language — 'large effects,' and the headline that Africa and equatorial location no longer predict lower income once institutions are controlled — is a strong criterion-dependent over-generalisation from a 64-country observational IV. A reader can legitimately anticipate that such sweeping conclusions may be sensitive to data and specification choices, even without seeing the specific coding errors."}],"strict":{"judgements":[{"flawId":"f5","matched":true,"matchedConcernId":"c1","evidence":"Flaw f5 states the identification 'rests entirely on a single instrument (settler mortality) whose validity and measurement are assumed rather than independently verified.' This has two prongs: exclusion/validity and measurement. Concern c1 directly addresses the validity prong: 'The exclusion restriction for the instrument is asserted but not defended... historical mortality could plausibly proxy for present-day disease environment, geography, or human-capital channels.' This names the same specific methodological problem (instrument validity assumed, not verified). Concern c7 separately covers the measurement prong ('no information about the provenance, comparability, or quality of the historical European mortality data, which is the linchpin'). Either substantively matches; c1 best captures the core single-point-of-failure validity concern."},{"flawId":"f6","matched":true,"matchedConcernId":"c5","evidence":"Flaw f6 targets 'strong, far-reaching causal claims (large effects of institutions; geography/Africa dummies become insignificant once institutions are controlled) from a cross-country observational design... vulnerable to fragile identification.' Concern c5 directly engages the same headline claim: 'the conclusion that Africa and equatorial countries lack lower incomes once institutions are controlled for risks overclaiming, since it treats geography purely as a confounder to be partialled out.' This is the same specific over-generalisation the expert flags (the Africa/geography-becomes-insignificant claim being an overreach from the design). The match is on the same named claim and the same fragility/overclaiming problem, not merely topical."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f5","matched":true,"matchedConcernId":"c1","evidence":"Flaw f5 is that identification rests entirely on a single instrument (settler mortality) whose validity AND measurement are assumed rather than verified, creating a single point of failure. Concern c1 directly targets the instrument's exclusion-restriction validity: 'The exclusion restriction for the instrument is asserted but not defended... historical mortality could plausibly proxy for present-day disease environment, geography, or human-capital channels that bear directly on income.' This is the validity half of f5. Concern c7 separately flags the measurement/data-quality half ('no information about the provenance, comparability, or quality of the historical European mortality data, which is the linchpin of the entire identification strategy'). c1 alone substantively identifies the same underlying weakness — the unverified validity of the sole instrument that drives all downstream claims — so the match holds (c7 reinforces it on the measurement axis)."},{"flawId":"f6","matched":true,"matchedConcernId":"c5","evidence":"Flaw f6 is the strong, far-reaching causal claim from a cross-country observational design — specifically 'large effects' and the headline that Africa/equatorial dummies become insignificant once institutions are controlled, which may be fragile/sensitive to specification. Concern c5 targets exactly this headline: 'The conclusion that Africa and equatorial countries lack lower incomes once institutions are controlled for risks overclaiming, since it treats geography purely as a confounder to be partialled out.' This is the same over-generalisation f6 names. It is reinforced by c2 (the 'large effects' are stated with no point estimate/CI/sample size, so magnitude and precision cannot be assessed) and c8 (sample restricted to former colonies, limiting generalizability). c5 substantively captures the criterion-dependent over-claiming f6 describes, so matched=true."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":[],"voidedMatches":[],"detectabilityDisputes":[{"flawId":"f6","assigned":"abstract_detectable","auditor":"abstract_detectable","disposition":"overturn_overridden","note":"The adversarial auditor overturned the f6<-c5 match as topical; adjudication OVERRODE that overturn. c5 is a precise exclusion-restriction critique (geography may affect income through channels other than institutions, threatening the instrument) — the core identification fragility f6 names, a substantive match the auditor judged too harshly."}],"rejectedFlags":[]}},{"slug":"abortion-crime","targetTitle":"The Impact of Legalized Abortion on Crime","targetDoi":"10.1162/00335530151144050","openalexId":"https://openalex.org/W2889639449","aiRelated":false,"field":"Economics","expertSource":"Foote & Goetz (2008), Quarterly Journal of Economics, Comment","expertCritique":"The comment identifies a coding mistake in the within-state cohort regressions and shows that correcting it and using a per-capita crime specification sharply weakens the results. It also shows the cross-state tests are not robust to allowing differential state trends.","blindConcerns":[{"id":"c1","summary":"The headline causal claim that legalized abortion 'contributed significantly to' and 'account[s] for as much as 50 percent of' the crime drop is far stronger than the correlational, observational evidence described, which cannot rule out confounders that also differed between high- and low-abortion states.","dimension":"claims","abstractBasis":"Legalized abortion appears to account for as much as 50 percent of the recent drop in crime.","confidence":"high"},{"id":"c2","summary":"The 50 percent figure is presented as an upper bound ('as much as') yet functions as the abstract's takeaway, an overclaiming pattern where the most striking number is foregrounded without the accompanying lower bound, confidence interval, or uncertainty.","dimension":"overclaiming","abstractBasis":"as much as 50 percent of the recent drop in crime","confidence":"high"},{"id":"c3","summary":"The 'roughly eighteen years' timing argument is a post-hoc lag chosen to align legalization with the crime decline; the abstract gives no evidence the lag was pre-specified rather than fitted to the data, leaving the temporal coincidence open to alternative explanations.","dimension":"methods","abstractBasis":"Crime began to fall roughly eighteen years after abortion legalization.","confidence":"medium"},{"id":"c4","summary":"The five-state early-legalization comparison rests on a very small treated group whose crime trends could diverge for many reasons unrelated to abortion; the abstract reports the directional difference without any control for the states' other characteristics.","dimension":"statistics","abstractBasis":"The five states that allowed abortion in 1970 experienced declines earlier than the rest of the nation","confidence":"medium"},{"id":"c5","summary":"The cross-state correlation between 1970s-80s abortion rates and 1990s crime reductions is vulnerable to omitted-variable bias, since states with high abortion rates may differ systematically (e.g., urbanization, policing, drug markets) in ways that independently drive crime trends.","dimension":"methods","abstractBasis":"States with high abortion rates in the 1970s and 1980s experienced greater crime reductions in the 1990s.","confidence":"high"},{"id":"c6","summary":"The cohort test treats 'born after abortion legalization' arrest declines as evidence of mechanism, but the abstract does not establish that pre- and post-legalization cohorts in high-abortion states are otherwise comparable, so the differential could reflect age-period-cohort dynamics rather than abortion.","dimension":"measurement","abstractBasis":"only arrests of those born after abortion legalization fall relative to low abortion states","confidence":"medium"},{"id":"c7","summary":"Abortion rates are a measured proxy whose accuracy in the 1970s-80s is uncertain (legal vs. actual, reporting differences across states), yet the abstract treats high/low abortion status as a clean classifier without acknowledging measurement error.","dimension":"measurement","abstractBasis":"States with high abortion rates in the 1970s and 1980s","confidence":"medium"},{"id":"c8","summary":"The abstract reports no statistics, sample sizes, model specifications, or robustness checks, so the strength and precision of the central estimate cannot be assessed from the abstract alone.","dimension":"reproducibility","abstractBasis":"We offer evidence that legalized abortion has contributed significantly to recent crime reductions.","confidence":"medium"},{"id":"c9","summary":"Multiple distinct empirical strategies (timing, five-state, cross-state rates, cohort arrests) are presented as mutually reinforcing, but the abstract does not indicate whether they share common confounders; convergence of correlations that all suffer the same omitted variables would not constitute independent corroboration.","dimension":"claims","abstractBasis":"States with high abortion rates in the 1970s and 1980s experienced greater crime reductions in the 1990s. In high abortion states, only arrests of those born after abortion legalization fall","confidence":"medium"}],"expertFlaws":[{"id":"f1","claim":"The headline within-state cohort result ('only arrests of those born after abortion legalization fall relative to low abortion states') rests on a regression that was intended to include state-year interaction terms but, due to a coding mistake, omitted them; correcting this guts the effect.","detectability":"fulltext_or_external","justification":"The coding error is a hidden defect in the regression code/specification. The abstract reports only the substantive finding and says nothing about the included fixed effects or interaction terms, so the error could only be found by inspecting the full paper's specification and replicating the regression. Not anticipatable from the abstract's wording alone."},{"id":"f2","claim":"The within-state results depend on using arrest counts rather than a per-capita (arrests-divided-by-population) specification; moving to per-capita crime rates sharply weakens the estimated abortion effect because cohort size itself drives arrest counts.","detectability":"fulltext_or_external","justification":"Whether the dependent variable is a count or a per-capita rate, and the sensitivity to that choice, is a specification detail invisible in the abstract. It can only be uncovered by reading the methods and re-running the model on a normalized outcome. The abstract gives no hint of the levels-vs-rate modeling choice."},{"id":"f3","claim":"The cross-state test (high-abortion states in the 1970s-80s showing greater 1990s crime declines) is not robust to allowing differential state-specific time trends; once such trends are permitted, the abortion-crime relationship is not reliably identified.","detectability":"abstract_detectable","justification":"The abstract itself frames this as a cross-state correlation between an early-period abortion rate and a later-period crime change, with no mention of controlling for pre-existing state trends. A careful reader can legitimately flag that such a panel correlation is vulnerable to omitted differential state trends -- the methodological antecedent (no trend controls visible) is exposed by the abstract's own description, even though the empirical demonstration required reanalysis."},{"id":"f4","claim":"The attribution of 'as much as 50 percent' of the crime drop to legalized abortion is an overstated magnitude claim that collapses once the coding error is fixed and per-capita / trend-robust specifications are used.","detectability":"abstract_detectable","justification":"The abstract makes a very large, precise causal-magnitude claim ('as much as 50 percent') from observational state-panel data. A careful reader can flag the magnitude as a fragile, specification-dependent headline that warrants robustness scrutiny -- the over-reach is visible in the abstract's own wording, though the specific arithmetic of its collapse needed the reanalysis."},{"id":"f5","claim":"The overall identification leans on observational timing coincidences (crime falling 'roughly eighteen years after' legalization; five early-legalizing states declining earlier) that are confounded by many other contemporaneous determinants of 1990s crime trends.","detectability":"abstract_detectable","justification":"The abstract explicitly grounds the causal claim in timing and a handful of early-adopting states, which is a classic observational identification-from-timing design. A reader can legitimately anticipate confounding and the small number of treated 'early' units (five states) as a robustness/generalization risk directly from the abstract's wording."}],"strict":{"judgements":[{"flawId":"f3","matched":true,"matchedConcernId":"c5","evidence":"c5 states the cross-state correlation 'is vulnerable to omitted-variable bias, since states with high abortion rates may differ systematically (e.g., urbanization, policing, drug markets) in ways that independently drive crime trends.' This identifies the same core methodological problem as f3: the cross-state panel correlation between early-period abortion rates and later crime declines is not identified because states differ systematically in ways that drive crime trends. f3 frames this specifically as differential state-specific time trends; c5 names unobserved state heterogeneity driving crime trends in exactly this design. The underlying defect (no control for systematic state differences in crime trajectories) is the same, so this is a substantive, not merely topical, match."},{"flawId":"f4","matched":true,"matchedConcernId":"c2","evidence":"c2 flags that 'The 50 percent figure is presented as an upper bound (as much as) yet functions as the abstract's takeaway, an overclaiming pattern where the most striking number is foregrounded without the accompanying lower bound, confidence interval, or uncertainty.' f4 claims the 'as much as 50 percent' is an overstated, fragile, specification-dependent magnitude claim. Both identify the same problem: the large magnitude claim is over-reaching and not backed by demonstrated robustness/uncertainty from observational data. This is a substantive match on the overstated-magnitude defect."},{"flawId":"f5","matched":true,"matchedConcernId":"c4","evidence":"f5 names two prongs: confounded timing coincidence and the small number (five) of early-legalizing treated states. c4 directly identifies the five-state issue: 'The five-state early-legalization comparison rests on a very small treated group whose crime trends could diverge for many reasons unrelated to abortion... without any control for the states other characteristics.' This matches f5's small-treated-units and confounding concern. c3 additionally matches the timing prong (post-hoc eighteen-year lag, temporal coincidence open to alternative explanations). Together c3/c4 substantively capture f5's observational-timing-identification-with-confounding defect; c4 is the primary match."}],"overclaimedConcerns":[]},"charitable":{"judgements":[{"flawId":"f3","matched":true,"matchedConcernId":"c5","evidence":"c5 states the cross-state correlation 'is vulnerable to omitted-variable bias, since states with high abortion rates may differ systematically' and independently drive crime trends. f3 is specifically about the cross-state test not being robust to differential state-specific time trends. c5 directly targets the same cross-state correlation (states with high abortion rates in 1970s-80s experiencing greater 1990s declines) and flags that systematic state-level differences could independently drive crime trends. Differential state trends are a leading instance of exactly this omitted-variable-across-states concern, so this substantively identifies the same underlying methodological weakness in the same test."},{"flawId":"f4","matched":true,"matchedConcernId":"c2","evidence":"c2 flags the 'as much as 50 percent' figure as an overclaiming pattern where 'the most striking number is foregrounded without the accompanying lower bound, confidence interval, or uncertainty,' and c4's justification notes the over-reach is visible in the abstract's wording. f4 says the '50 percent' magnitude is an overstated, fragile, specification-dependent headline. c2 substantively identifies the same flaw: an overstated, unqualified large magnitude claim that warrants robustness/uncertainty scrutiny. (c1 also touches this but emphasizes causal-vs-correlational; c2 is the closest substantive match to the magnitude-overstatement flaw.)"},{"flawId":"f5","matched":true,"matchedConcernId":"c5","evidence":"f5 concerns identification leaning on timing coincidences (crime falling ~18 years after legalization; five early-legalizing states) confounded by other contemporaneous determinants. Multiple blind concerns target exactly these: c3 flags the 'roughly eighteen years' lag as post-hoc and 'open to alternative explanations'; c4 flags the five-state comparison as 'a very small treated group whose crime trends could diverge for many reasons unrelated to abortion'; c5 raises omitted-variable confounders. Together these substantively identify the same observational-identification-from-timing weakness with confounding and the small five-state treated group that f5 describes. The strongest single match is c4, which directly names both the small treated group and confounding."}],"overclaimedConcerns":[]},"audit":{"overturnedFlawIds":["f3"],"voidedMatches":[],"detectabilityDisputes":[],"rejectedFlags":[]}}]}