Benchmarks · the calibration corpus

Published critiques, as the standard to meet

Name: Critical AI — published-critique benchmark corpus
Creator: Critical AI

Critique is not a novelty Critical AI invented. The strongest journals in every social science publish formal Comments, author Replies, replication studies and reanalyses — adversarial scholarship that contests a specific claim in print. This page collects real exemplars of that genre as a calibration corpus: the bar an AI-native critique should clear. Each one’s DOI is independently Crossref-verified, because a benchmark built on a fabricated citation would be worse than none.

68 verified critiques21 on AI/algorithmic targetsacross 48 venues · 61 fields10 critique dimensions exercisedJSON ↗

Comment11

A formal Comment published in the same journal, contesting a specific claim, method, or result of the target paper.

Tier SApplied Microeconomics / Crime · 2008
The Impact of Legalized Abortion on Crime: Comment
Christopher L. Foote, Christopher F. Goetz · Quarterly Journal of Economics · 2008
The comment identifies a coding mistake in the within-state cohort regressions of Donohue and Levitt's abortion-crime paper and shows that correcting it and using a per-capita crime specification sharply weakens the results. It also shows the cross-state tests are not robust to allowing differential state trends.
Critiques: The Impact of Legalized Abortion on Crime — John J. Donohue III, Steven D. Levitt
methodsidentificationdata_codestatisticsclaimsreproducibility
Crossref-verified: DOI resolves in Crossref to this exact title in Quarterly Journal of Economics (2008), indexed as journal-article.
Tier SDevelopment Economics / Institutions · 2012
The Colonial Origins of Comparative Development: An Empirical Investigation: Comment
David Y. Albouy · American Economic Review · 2012
Albouy shows that 36 of the 64 countries are assigned settler-mortality rates borrowed from other countries and that incomparable rates from laborers, bishops, and soldiers on campaign are combined in ways favorable to the institutions hypothesis. Once data problems are addressed, the mortality-expropriation relationship and the instrumental-variable estimates lose robustness, often yielding effectively infinite confidence intervals.
Critiques: The Colonial Origins of Comparative Development: An Empirical Investigation — Daron Acemoglu, Simon Johnson, James A. Robinson
data_codeidentificationmethodsstatisticsclaimsgeneralisation
Crossref-verified: DOI resolves in Crossref to this exact title in American Economic Review (2012), indexed as journal-article.
Tier SPsychology / metascience · 2016
Comment on "Estimating the reproducibility of psychological science"
Daniel T. Gilbert, Gary King, Stephen Pettigrew, Timothy D. Wilson · Science · 2016
Argues the Open Science Collaboration's Reproducibility Project contains three statistical errors (low-power replications, non-representative study sampling, and misleading endorsement criteria) that bias the reported replication rate downward. Concludes the data are actually consistent with very high reproducibility, not the low rate the original claimed.
Critiques: Estimating the reproducibility of psychological science — Open Science Collaboration
statisticsmethodsidentificationclaimsreproducibilitygeneralisation
Crossref-verified: DOI resolves in Crossref to this exact title in Science (2016), indexed as journal-article.
Tier SStrategic Management / Organization · 2016
Reaffirming the CEO Effect Is Significant and Much Larger than Chance: A Comment on Fitza (2014)
Timothy J. Quigley, Scott D. Graffin · Strategic Management Journal · 2016
Challenges Fitza's (2014) claim that the estimated 'CEO effect' on firm performance is almost entirely an artifact of random chance, arguing his simulation/variance-decomposition approach mis-specifies the chance baseline. Using corrected methods, they conclude the CEO effect is statistically significant and substantively much larger than chance.
Critiques: The use of variance decomposition in the investigation of CEO effects: How large must the CEO effect be to rule out chance? — Markus A. Fitza
methodsidentificationstatisticsreproducibility
Crossref-verified: DOI resolves in Crossref to this exact title in Strategic Management Journal (2016), indexed as journal-article.
Tier ASociology (immigration/assimilation) · 2011
Commentary: The Kids Are (Mostly) Alright: Second-Generation Assimilation: Comments on Haller, Portes and Lynch
Richard Alba, Philip Kasinitz, Mary C. Waters · Social Forces · 2011
Challenges Haller, Portes and Lynch's pessimistic 'segmented assimilation / downward assimilation' thesis about the immigrant second generation, arguing that their data, model specification and interpretation overstate the prevalence and inevitability of downward mobility, and that the bulk of second-generation outcomes are in fact reasonably positive.
Critiques: Dreams Fulfilled, Dreams Shattered: Determinants of Segmented Assimilation in the Second Generation
claimstheorymethodsgeneralisationoverclaiming
Crossref-verified: DOI resolves in Crossref to this exact title in Social Forces (2011), indexed as journal-article.
Tier AEducation / Special Education Policy · 2016
Risks and Consequences of Oversimplifying Educational Inequities: A Response to Morgan et al. (2015)
Russell J. Skiba, Alfredo J. Artiles, Elizabeth B. Kozleski, Daniel J. Losen et al. · Educational Researcher · 2016
Directly challenges Morgan et al.'s widely-cited claim that racial/ethnic minorities are underrepresented (not overrepresented) in special education, arguing the conclusion is in error due to sampling and model-specification choices, the heavy covariate adjustment that conditions away the inequities of interest, and failure to engage the broader complexity of disproportionality.
Critiques: Minorities Are Disproportionately Underrepresented in Special Education: Longitudinal Evidence Across Five Disability Conditions
methodsidentificationstatisticsclaimsoverclaiminggeneralisation
Crossref-verified: DOI resolves in Crossref to this exact title in Educational Researcher (2016), indexed as journal-article.
Tier SEconomics (education / labor economics) · 2017
Measuring the Impacts of Teachers: Comment
Jesse Rothstein · American Economic Review · 2017
Challenges Chetty, Friedman, and Rockoff's (2014) claim that the teacher-switching quasi-experiment shows student sorting creates negligible bias in teacher value-added (VA) scores. Rothstein shows teacher switching is correlated with changes in prior student preparedness, so the design is invalid; correcting for this reveals moderate VA bias (10-35% of teachers' causal effect variance) and shows long-run results are fragile to control choices.
Critiques: Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates — Raj Chetty, John N. Friedman, Jonah E. Rockoff
identificationmethodsstatisticsreproducibilityclaims
Crossref-verified: DOI resolves in Crossref to this exact title in American Economic Review (2017), indexed as journal-article.
Tier SEconomics (labor economics) · 2000
Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Comment
David Neumark, William Wascher · American Economic Review · 2000
Re-examines Card and Krueger's (1994) finding that a New Jersey minimum-wage increase did not reduce (and may have raised) fast-food employment. Using administrative payroll records rather than the original telephone-survey data, Neumark and Wascher find that employment fell after the minimum-wage rise, contradicting the original positive/zero estimate and attributing the discrepancy to measurement error in the survey data.
Critiques: Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania — David Card, Alan B. Krueger
data_codeidentificationmethodsclaimsreproducibility
Crossref-verified: DOI resolves in Crossref to this exact title in American Economic Review (2000), indexed as journal-article.
Tier AQuantitative criminology / public policy · 2001
Impossible Policy Evaluations and Impossible Conclusions: A Comment on Koper and Roth
Gary Kleck · Journal of Quantitative Criminology · 2001
Argues that Koper and Roth's evaluation of the 1994 federal assault weapons ban could not, by design, detect any plausible effect because assault weapons figure in so few homicides, so the data lack statistical power to support any conclusion. Contends their tentative inference that the ban may have reduced gun homicides is essentially impossible to sustain from the evidence presented.
Critiques: The Impact of the 1994 Federal Assault Weapon Ban on Gun Violence Outcomes: An Assessment of Multiple Outcome Measures and Some Lessons for Policy Evaluation — Christopher S. Koper, Jeffrey A. Roth
statisticsidentificationmethodsclaimsoverclaiming
Crossref-verified: DOI resolves in Crossref to this exact title in Journal of Quantitative Criminology (2001), indexed as journal-article.
Tier SCriminology · 2003
Long-Term Trends in Crimes of Violence (Comment on Cooney, 2003)
David F. Greenberg · Criminology · 2003
Challenges Cooney's (2003) historical thesis that violence has been 'privatized' (shifting from elite/collective to marginal/individual actors), arguing the long-term empirical evidence on violent-crime trends and the social characteristics of offenders does not support the claimed qualitative transformation. Questions the selectivity and interpretation of the historical and anthropological evidence Cooney marshals.
Critiques: The Privatization of Violence — Mark Cooney
theoryclaimsmethodsgeneralisationdata_code
Crossref-verified: DOI resolves in Crossref to this exact title in Criminology (2003), indexed as journal-article.
Tier BNeuroimaging / psychiatry (fMRI methodology) · 2020
Variability may limit the translation of neuroimaging findings comment on “Variability in the analysis of a single neuroimaging dataset by many teams”
Rotem Botvinik-Nezer et al. (NARPS) · Journal of Affective Disorders · 2020
It comments that the analytic variability documented when many teams analyzed the same fMRI dataset undermines the persuasiveness of fMRI findings, and argues this reproducibility problem stems from the hypothesis-testing paradigm. It proposes that a machine-learning-based predictive-modeling approach could mitigate the issue and better detect subtle spatial brain patterns and individual-level effects.
Critiques: Botvinik-Nezer et al. (2020), "Variability in the analysis of a single neuroimaging dataset by many teams" (the NARPS study), and more broadly the hypothesis-testing paradigm in standard fMRI analysis.
reproducibilitymethodsstatistics
Crossref-verified: DOI resolves in Crossref to "Variability may limit the translation of neuroimaging findings comment on “Variability in the analysis of a single neuroimaging dataset by many teams”" in Journal of Affective Disorders (2020). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.

Reply / Rejoinder5

The original authors' Reply or Rejoinder — the other half of an adversarial exchange the journal published in full.

Tier APolitical communication / media exposure measurement · 2013
The Challenge of Measuring Media Exposure: Reply to Dilliplane, Goldman, and Mutz
Markus Prior · Political Communication · 2013
Prior critiques the program-list measure of televised political exposure that Dilliplane, Goldman, and Mutz proposed (and which the ANES adopted), arguing it has low construct validity because it never measures the amount of exposure and shows poor convergent validity by several criteria. He contends the measure conflates recall/recognition with exposure and overstates the predictive payoff of the new instrument.
Critiques: Televised Exposure to Politics: New Measures for a Fragmented Media Environment
methodsstatisticsclaimsgeneralisation
Crossref-verified: DOI resolves in Crossref to this exact title in Political Communication (2013), indexed as journal-article.
Tier APsychology (social/educational psychology) · 2004
Stereotype Threat Does Not Live by Steele and Aronson (1995) Alone
Claude M. Steele, Joshua Aronson · American Psychologist · 2004
It challenges Sackett et al.'s critique by arguing that their extremely narrow focus on the reporting of a single experiment from the first stereotype-threat article greatly exaggerates three issues, which the comment then addresses in turn. It defends the broader stereotype-threat literature against the claim that the original experiment has been pervasively mischaracterized.
Critiques: Sackett, Hardison & Cullen's critique of the stereotype-threat interpretation of the Black–White test-score gap
claimsoverclaiminggeneralisation
Crossref-verified: DOI resolves to this exact title/citation in American Psychologist 59(1):47–48 (2004); abstract captured from APA PsycNet. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre (original authors' reply/rejoinder to the Sackett et al. critique) confirmed.
Tier AEpidemiology / development economics · 2015
Commentary: Deworming externalities and schooling impacts in Kenya: a comment on Aiken et al. (2015) and Davey et al. (2015)
Edward Miguel, Michael Kremer · International Journal of Epidemiology · 2015
This is the original authors' reply to two re-analyses of their 2004 deworming study; they acknowledge the re-analyses corrected some errors but argue the updated results are extremely similar to the originals, with externality and school-participation effects remaining significant, so their key conclusion (that individually randomized studies underestimate deworming impacts) still holds. Rather than challenging a prior finding, it defends the original study by interpreting the re-analysis as confirmatory.
Critiques: Aiken et al. (2015) and Davey et al. (2015) re-analysis of the Miguel & Kremer (2004) Kenya deworming study
statisticsreproducibilityclaimsmethods
Crossref-verified: DOI resolves to this exact title in International Journal of Epidemiology 44(5):1593 (2015); published extract captured from the publisher landing page (academic.oup.com). whatItChallenges grounded in the extract via the faithfulness ingestion gate (2026-06-22); genre (original authors' reply in a published reanalysis exchange) confirmed.
Tier ADemography / development economics · 2006
On Explaining Asia's "Missing Women": Comment on Das Gupta
Emily Oster · Population and Development Review · 2006
Oster replies to Das Gupta's objection that sex-ratio patterns over time and across families (girls faring worse under resource constraints, later births skewing male after earlier girls) point to cultural/son-preference explanations rather than hepatitis B. Oster defends the hepatitis B mechanism alongside cultural factors, distinguishing the working-paper and published versions of her own analysis.
Critiques: Monica Das Gupta (2005) comment arguing the hepatitis-B explanation of Asia's 'missing women' is unlikely to be important
claimsidentificationgeneralisationtheory
Crossref-verified: DOI 10.1111/j.1728-4457.2006.00120.x confirmed by the downloaded full-text PDF (Population and Development Review 32(2):323–327, 2006). whatItChallenges grounded in the article's opening (full text) via the faithfulness ingestion gate (2026-06-22); gate confirmed type=reply (Oster's reply to Das Gupta's comment, distinct from the critic's comment). Titled 'Comment on Das Gupta' but functionally Oster's reply defending her prior work. Source: user-supplied PDF download.
Tier SEconomics (labor economics) · 2000
Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Reply
David Card, Alan B. Krueger · American Economic Review · 2000
It responds to Neumark & Wascher's Comment, which attributed the contrary "employment decline" finding to flaws in Card & Krueger's telephone-survey employment data versus payroll records. Card & Krueger attempt to reconcile the two by analyzing administrative employment data from a new representative sample of NJ and PA fast-food employers, reanalyzing NW's data, and most importantly using BLS employer-reported ES-202 data to track a fixed longitudinal sample of major-chain establishments from 1992 to 1993.
Critiques: Neumark & Wascher (2000), Comment on Card & Krueger (1994) — using EPI/payroll-record data to argue the 1992 NJ minimum-wage increase reduced fast-food employment — David Neumark, William Wascher (2000)
data_codemethodsreproducibilityclaimsstatistics
Crossref-verified: DOI 10.1257/aer.90.5.1397 confirmed by the downloaded full-text PDF (AER 90(5):1397–1420, Dec 2000; JSTOR sici 0002-8282…1397). whatItChallenges grounded in the article's full text (abstract+intro) via the faithfulness ingestion gate (2026-06-22); genre confirmed = reply (Card & Krueger's reply to the Neumark & Wascher (2000) Comment on their 1994 AER study). Target DOI links the in-corpus Neumark-Wascher Comment (id neumark-minimum-wages-employment) — an explicit Comment↔Reply exchange. Source: user-supplied PDF download (replaces the earlier wrong-file njmin-aer.pdf, which was the 1994 original).

Rejoinder3

A rejoinder closing out a published exchange.

Tier SStrategic Management / Organization · 2016
How Much Do CEOs Really Matter? Reaffirming That the CEO Effect Is Mostly Due to Chance
Markus A. Fitza · Strategic Management Journal · 2016
Rejoinder defending the original conclusion against Quigley and Graffin's comment, arguing that once more realistic assumptions about how chance affects firm performance are imposed, the apparent CEO effect is statistically indistinguishable from chance regardless of the estimation methodology used.
Critiques: Reaffirming the CEO Effect Is Significant and Much Larger than Chance: A Comment on Fitza (2014) (Quigley & Graffin 2017)
methodsidentificationstatisticsclaims
Crossref-verified: DOI resolves in Crossref to this exact title in Strategic Management Journal (2016), indexed as journal-article.
Tier SPolitical Science (experimental methods / voter mobilization) · 2005
Correction to Gerber and Green (2000), Replication of Disputed Findings, and Reply to Imai (2005)
Alan S. Gerber, Donald P. Green · American Political Science Review · 2005
Responds to Imai's (2005) reanalysis: acknowledges and repairs data-processing errors in the original 2000 article, then argues Imai's correction itself contains statistical, computational, and reporting errors that invalidate its conclusions. After fixes, the original substantive finding stands that brief phone calls do not meaningfully increase voter turnout.
Critiques: Do Get-Out-the-Vote Calls Reduce Turnout? The Importance of Statistical Methods for Field Experiments (Imai, 2005)
statisticsmethodsdata_codereproducibilityclaims
Crossref-verified: DOI resolves in Crossref to this exact title in American Political Science Review (2005), indexed as journal-article.
Tier AEducation / Mathematics Education Policy · 2008
Rejoinder to the Critiques of the National Mathematics Advisory Panel Final Report
Camilla Persson Benbow, Larry R. Faulkner · Educational Researcher · 2008
The Panel chair and co-chair rebut a cluster of published critiques (notably Boaler and Kelly) that attacked the Panel's restriction to randomized/quasi-experimental evidence on math curricula, instruction, and learning, defending the evidentiary standards and contesting claims that the report's methodology was inappropriate for educational field research.
Critiques: Critiques of the National Mathematics Advisory Panel Final Report (incl. Boaler, 'When Politics Took the Place of Inquiry,' and Kelly, 'Reflections on the National Mathematics Advisory Panel Final Report')
methodsidentificationclaimsgeneralisationtheory
Crossref-verified: DOI resolves in Crossref to this exact title in Educational Researcher (2008), indexed as journal-article.

Replication study19

An independent attempt to reproduce the target's result with its data and code, or to repeat it in a new sample.

Tier AMacroeconomics / Public Finance · 2013
Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff
Thomas Herndon, Michael Ash, Robert Pollin · Cambridge Journal of Economics · 2013
Replicating Reinhart and Rogoff's claim that public debt above 90% of GDP is associated with sharply lower growth, the authors find a spreadsheet coding error, selective exclusion of available country-year data, and unconventional weighting. Corrected, average real growth for high-debt countries is +2.2%, not the published -0.1%, eliminating the supposed debt threshold.
Critiques: Growth in a Time of Debt — Carmen M. Reinhart, Kenneth S. Rogoff
methodsdata_codereproducibilitystatisticsclaimsoverclaiming
Crossref-verified: DOI resolves in Crossref to this exact title in Cambridge Journal of Economics (2013), indexed as journal-article.
Tier ASocial/cognitive psychology (self-control) · 2016
A Multilab Preregistered Replication of the Ego-Depletion Effect
Martin S. Hagger, Nikos L. D. Chatzisarantis, Hugo Alberts, Calvin Octavianus Anggono · Perspectives on Psychological Science · 2016
A preregistered Registered Replication Report across 23 labs (~2000 participants) tested the sequential-task ego-depletion effect and found a meta-analytic effect indistinguishable from zero (d ≈ 0.04). Challenges the existence and robustness of the widely cited ego-depletion / limited-resource model of self-control.
Critiques: Methylphenidate Blocks Effort-Induced Depletion of Regulatory Control in Healthy Volunteers — Chandra Sripada, Daniel Kessler, John Jonides
reproducibilitymethodsstatisticsclaimsgeneralisation
Crossref-verified: DOI resolves in Crossref to this exact title in Perspectives on Psychological Science (2016), indexed as journal-article.
Tier ASocial psychology / emotion · 2016
Registered Replication Report: Strack, Martin, & Stepper (1988)
E.-J. Wagenmakers, Titia Beek, Laura Dijkhoff, Quentin F. Gronau · Perspectives on Psychological Science · 2016
A Registered Replication Report of 17 direct replications of the classic pen-in-mouth facial-feedback study found a pooled effect of 0.03 rating units (95% CI -0.11 to 0.16) versus the original 0.82, failing to replicate the claim that induced smiling increases rated funniness. Challenges a textbook facial-feedback finding.
Critiques: Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis — Fritz Strack, Leonard L. Martin, Sabine Stepper
reproducibilitymethodsstatisticsclaimstheory
Crossref-verified: DOI resolves in Crossref to this exact title in Perspectives on Psychological Science (2016), indexed as journal-article.
Tier SSocial psychology / embodied cognition · 2015
Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women
Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg et al. · Psychological Science · 2015
A larger, better-powered replication (N=200) of the power-posing study replicated only self-reported feelings of power but found no effect of expansive postures on testosterone, cortisol, or behavioral risk tolerance. Challenges the central physiological and behavioral claims of the original power-posing paper.
Critiques: Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance — Dana R. Carney, Amy J. C. Cuddy, Andy J. Yap
reproducibilitymethodsstatisticsclaimsoverclaiming
Crossref-verified: DOI resolves in Crossref to this exact title in Psychological Science (2015), indexed as journal-article.
Tier APsychology / metascience · 2018
Many Labs 2: Investigating Variation in Replicability Across Samples and Settings
Richard A. Klein, Michelangelo Vianello, Fred Hasselman · Advances in Methods and Practices in Psychological Science · 2018
A large preregistered multi-site project replicated 28 published effects across 60+ samples and ~15,000 participants; only about half replicated robustly and variation across samples/settings was generally small, implying non-replication reflects original effects rather than hidden moderators. Challenges the robustness and breadth of numerous canonical findings.
Critiques: 28 classic and contemporary psychological findings (multi-target replication, e.g. Tversky & Kahneman framing, Schwarz heuristics, moral-judgment effects) — Various original authors
reproducibilitymethodsstatisticsgeneralisationclaims
Crossref-verified: DOI resolves in Crossref to this exact title in Advances in Methods and Practices in Psychological Science (2018), indexed as journal-article.
Tier SStrategic Management / Organization · 2022
The "CEO in Context" Technique Revisited: A Replication and Extension of Hambrick and Quigley (2014)
Tobias Keller, Martin Glaum, Andreas Bausch, Thorsten Bunz · Strategic Management Journal · 2022
Replicates and extends the 'CEO in Context' technique on a far larger sample (33,996 firm-years vs 4,866) and broadly CONFIRMS the original's high CEO effect — attributing about a third of the variance in firm performance (ROA) to the CEO — while showing the estimate shrinks under an adjusted-R² specification, a within-paper robustness nuance rather than an overturning of the headline finding.
Critiques: Toward More Accurate Contextualization of the CEO Effect on Firm Performance (Hambrick & Quigley 2014)
methodsstatisticsreproducibilitygeneralisationidentification
Crossref-verified: DOI resolves in Crossref to this exact title in Strategic Management Journal (2022), indexed as journal-article.
Tier SPolitical Science (public opinion / political behavior) · 1998
Macropartisanship: A Replication and Critique
Donald Green, Bradley Palmquist, Eric Schickler · American Political Science Review · 1998
Replicates MacKuen, Erikson, and Stimson's claim that aggregate party identification swings substantially in response to short-term shocks like consumer sentiment and presidential approval. Using more extensive survey data and correcting for measurement error, finds the short-term partisan movement is two to three times smaller than originally reported, supporting a stable, slow-adjusting view of partisanship.
Critiques: Macropartisanship — Michael B. MacKuen, Robert S. Erikson, James A. Stimson
statisticsmethodsreproducibilityclaimsoverclaiming
Crossref-verified: DOI resolves in Crossref to this exact title in American Political Science Review (1998), indexed as journal-article.
Tier ASociology (political sociology / welfare state) · 2015
The Missing Main Effect of Welfare State Regimes: A Replication of 'Social Policy Responsiveness in Developed Democracies' by Brooks and Manza
Nate Breznau · Sociological Science · 2015
Replicates Brooks and Manza's (2006, ASR) claim that public opinion drives welfare-state spending and finds it rests on a model specification error: they included an opinion-by-welfare-regime interaction while omitting the main effect of welfare regime; restoring the missing main effect across more than 800 model configurations eliminates the original finding in roughly 99.5% of cases.
Critiques: Social Policy Responsiveness in Developed Democracies
statisticsmethodsreproducibilitydata_codeclaims
Crossref-verified: DOI resolves in Crossref to this exact title in Sociological Science (2015), indexed as journal-article.
Tier SPolitical science (political methodology / causal inference) · 2014
On the Validity of the Regression Discontinuity Design for Estimating Electoral Effects: New Evidence from Over 40,000 Close Races
Andrew C. Eggers, Anthony Fowler, Jens Hainmueller, Andrew B. Hall et al. · American Journal of Political Science · 2014
Assembling over 40,000 close races across many electoral settings, it finds no systematic evidence of strategic sorting or covariate imbalance at the threshold, arguing the close-election RD design is generally valid and that the Caughey-Sekhon/Snyder imbalance is largely specific to postwar U.S. House races rather than a general flaw. It reframes the earlier critique as an unusual case rather than evidence against RD broadly.
Critiques: Elections and the Regression Discontinuity Design: Lessons from Close U.S. House Races, 1942-2008
identificationmethodsstatisticsreproducibilitygeneralisationclaims
Crossref-verified: DOI resolves in Crossref to this exact title in American Journal of Political Science (2014), indexed as journal-article.
Tier SComputational social science / social-media text analysis · 2021AI-related target
Reconsidering evidence of moral contagion in online social networks
Jason W. Burton, Nicole Cruz, Ulrike Hahn · Nature Human Behaviour · 2021
Re-tests Brady et al.'s (2017) 'moral contagion' method on six new Twitter corpora rather than reanalysing their data, and finds via out-of-sample prediction, model comparison and specification-curve analysis that the moral-contagion model performs no better than an implausibly-named 'XYZ contagion' placebo — challenging the strength of the original correlational claim while conceding moral contagion may still exist.
Critiques: Emotion shapes the diffusion of moralized content in social networks
methodsstatisticsidentificationreproducibilityoverclaimingclaims
Crossref-verified: DOI resolves in Crossref to this exact title in Nature Human Behaviour (2021), indexed as journal-article.
Tier ANew media / games studies / media effects · 2019
Game perspective-taking effects on willingness to help immigrants: A replication study with a Spanish sample
Jorge Peña, Juan Francisco Hernández Pérez · New Media & Society · 2019
A replication of a perspective-taking game study on willingness to help immigrants. The original reported reductions in behavioural intention, subjective norms and self-efficacy (attitudes were unaffected); the Spanish-sample replication reproduced the intention effect but not the subjective-norms or self-efficacy effects, while finding an attitude effect the original did not — partly corroborating and partly diverging from the original.
Critiques: Game Perspective-Taking Effects on Players' Behavioral Intention, Attitudes, Subjective Norms, and Self-Efficacy to Help Immigrants: The Case of 'Papers, Please'
reproducibilitygeneralisationclaimsmethods
Crossref-verified: DOI resolves in Crossref to this exact title in New Media & Society (2019), indexed as journal-article.
Tier APublic administration · 2021
A replication of "Representative bureaucracy and the willingness to coproduce"
Martin Sievert · Public Administration · 2021
A wide replication, on new data, of Riccucci, Van Ryzin and Li's (2016) survey experiment on representative bureaucracy and citizens' willingness to coproduce, testing whether the original's representation effects hold in a different national context rather than re-running the original data.
Critiques: Representative Bureaucracy and the Willingness to Coproduce: An Experimental Study — Norma M. Riccucci, Gregg G. Van Ryzin, Huafang Li
reproducibilitystatisticsmethodsgeneralisationclaims
Crossref-verified: DOI resolves in Crossref to this exact title in Public Administration (2021), indexed as journal-article.
Tier Apolitical science · 2023AI-related target
Exposure to the Russian Internet Research Agency foreign influence campaign on Twitter in the 2016 US election and its relationship to attitudes and voting behavior
Gregory Eady, Tom Paskhalis, Jan Zilinsky, Richard Bonneau et al. · Nature Communications · 2023
Challenges the prevailing concern that the Russian Internet Research Agency's 2016 Twitter campaign meaningfully shaped US attitudes and votes. Using longitudinal survey data linked to respondents' Twitter feeds, it finds exposure was extremely concentrated (1% of users saw 70% of exposures), concentrated among strong Republicans, dwarfed by domestic news and politicians, and shows no evidence of a meaningful relationship to changes in attitudes, polarization, or voting behavior.
Critiques: The prevailing claim that the Russian Internet Research Agency's 2016 Twitter campaign had large effects on US attitudes and voting behaviour
claimsidentificationoverclaimingdata_codegeneralisation
Crossref-verified: DOI resolves in Crossref to "Exposure to the Russian Internet Research Agency foreign influence campaign on Twitter in the 2016 US election and its relationship to attitudes and voting behavior" in Nature Communications (2023), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22).
Tier AClinical medicine / medical informatics (sepsis prediction) · 2021AI-related target
External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients
Andrew Wong, Erkin Otles, John P. Donnelly, Andrew Krumm et al. · JAMA Internal Medicine · 2021
This study externally validates the proprietary Epic Sepsis Model on 38,455 hospitalizations and finds it has poor discrimination (AUROC 0.63) and calibration, missing 67% of sepsis patients while generating alerts for 18% of all hospitalizations (high alert-fatigue burden). It argues the model's widespread adoption despite this poor performance raises fundamental concerns about national sepsis management.
Critiques: The Epic Sepsis Model (ESM), a proprietary sepsis early-warning prediction algorithm implemented at hundreds of US hospitals.
statisticsclaimsoverclaiminggeneralisationmethods
Crossref-verified: DOI resolves in Crossref to "External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients" in JAMA Internal Medicine (2021). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier AComputational social science / criminal justice risk assessment · 2020AI-related target
The limits of human predictions of recidivism
Zhiyuan Lin, Jongbin Jung, Sharad Goel, Jennifer Skeem · Science Advances · 2020
The abstract reports a replication and extension of Dressel and Farid's study: under similar conditions it reproduces their finding that humans and algorithms perform comparably, but in three other datasets and in conditions without immediate feedback or with an enriched set of risk factors, algorithms outperformed humans. It concludes that algorithms can beat human recidivism predictions in ecologically valid settings, challenging the original's broader implication that risk tools add little value.
Critiques: Dressel and Farid's experiment finding that laypeople were as accurate as statistical algorithms in predicting recidivism (published as "The accuracy, fairness, and limits of predicting recidivism").
methodsgeneralisationclaimsoverclaiming
Crossref-verified: DOI resolves in Crossref to "The limits of human predictions of recidivism" in Science Advances (2020). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier SExperimental economics · 2016
Evaluating replicability of laboratory experiments in economics
Colin F. Camerer, Anna Dreber, Eskil Forsell, et al. · Science · 2016
The authors directly replicated 18 economics laboratory studies from AER and QJE (2011-2014) using pre-registered analysis plans with at least 90% statistical power, finding a significant same-direction effect in only 11 of 18 (61%) replications, with replicated effect sizes averaging 66% of the originals. This empirically tests and partially undercuts the reliability of the original published findings.
Critiques: 18 laboratory experiments in economics published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014
reproducibilitystatisticsmethodsclaims
Crossref-verified: DOI resolves in Crossref to "Evaluating replicability of laboratory experiments in economics" in Science (2016). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier BSocial psychology · 2019
Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement
Richard A. Klein, Corey L. Cook, Charles R. Ebersole, et al. · Collabra Psychology · 2019
A 17-lab, ~1,550-participant preregistered replication that failed to reproduce the classic Terror Management Theory mortality salience effect (Greenberg et al., 1994) under any condition, including with original-author involvement in study design. The authors conclude the original finding was either a false positive or that the conditions required to obtain it are not understood or no longer exist.
Critiques: The classic mortality salience / worldview-defense finding from Terror Management Theory, specifically Greenberg et al. (1994).
reproducibilitymethodsstatisticsclaims
Crossref-verified: DOI resolves in Crossref to "Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement" in Collabra Psychology (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.
Tier AEmpirical asset pricing / finance · 2017
Replicating Anomalies
Kewei Hou, Chen Xue, Lu Zhang · Review of Financial Studies · 2017
Re-testing 452 published anomalies with microcaps controlled via NYSE breakpoints and value-weighted returns, the study finds 65% fail the single-test |t|>=1.96 hurdle (82% under a 2.78 multiple-testing hurdle), and that even surviving anomalies have much smaller economic magnitudes than originally reported. It concludes capital markets are more efficient than the prior literature recognized.
Critiques: The published cross-sectional stock-return anomaly literature — the body of 452 documented anomalies (including the trading-frictions category) and their originally reported return predictability.
methodsstatisticsreproducibilityoverclaimingclaims
Crossref-verified: DOI resolves in Crossref to "Replicating Anomalies" in Review of Financial Studies (2017). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.
Tier SSocial science (experimental); metascience · 2018
Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015
Colin F. Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho et al. · Nature Human Behaviour · 2018
It re-tests 21 high-profile social-science experiments via pre-registered, high-powered replications (sample sizes ~5x the originals) and finds a significant same-direction effect for only 13 (62%), with replication effect sizes averaging about 50% of the originals, indicating that both false positives and inflated true-positive effect sizes contribute to imperfect reproducibility. It further reports that peer beliefs predicted which results would replicate, implying failures were not due to chance alone.
Critiques: 21 systematically selected social-science experiments published in Nature and Science (2010–2015)
reproducibilitystatisticsmethodsclaims
Crossref-verified: DOI resolves to this exact title in Nature Human Behaviour (2018); abstract captured from the publisher landing page (nature.com). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre (large-scale pre-registered replication, the Social Sciences Replication Project) confirmed.

Reanalysis17

A re-run of the target's own data under different — often more defensible — modelling choices.

Tier ACriminology / algorithmic risk assessment · 2018AI-related target
The accuracy, fairness, and limits of predicting recidivism
Julia Dressel, Hany Farid · Science Advances · 2018
A widely cited reanalysis showing that the commercial COMPAS recidivism risk algorithm (137 features) is no more accurate or fair than predictions from untrained humans on Mechanical Turk (62% vs 65%), and that a simple two-feature linear classifier matches COMPAS's accuracy. It directly challenges claims that proprietary ML risk-assessment tools provide superior, sophisticated predictive power over simple baselines.
Critiques: Evaluating the predictive validity of the COMPAS Risk and Needs Assessment System (Northpointe/COMPAS recidivism risk tool)
statisticsmethodsclaimsoverclaimingnovelty
Crossref-verified: DOI resolves in Crossref to this exact title in Science Advances (2018), indexed as journal-article.
Tier APolitical science / computational social science (ML methodology) · 2023AI-related target
Leakage and the reproducibility crisis in machine-learning-based science
Sayash Kapoor, Arvind Narayanan · Patterns · 2023
A reproducibility audit identifying data leakage as a pervasive failure mode across 294 ML-based-science papers in 17 fields; its central social-science case study reproduces civil-war prediction papers and shows that, after correcting for leakage, complex ML models do not outperform decades-old logistic regression, overturning published claims of ML superiority. It challenges overclaimed ML performance and proposes model info sheets as a remedy.
Critiques: ML-based civil-war / armed-conflict prediction studies claiming complex ML outperforms logistic regression (e.g., Colaresi & Mahmood and the systematic review of conflict-forecasting papers)
reproducibilitydata_codemethodsstatisticsoverclaimingnovelty
Crossref-verified: DOI resolves in Crossref to this exact title in Patterns (2023), indexed as journal-article.
Tier SPolitical Science (experimental methods / voter mobilization) · 2005
Do Get-Out-the-Vote Calls Reduce Turnout? The Importance of Statistical Methods for Field Experiments
Kosuke Imai · American Political Science Review · 2005
Reanalyzes Gerber and Green's influential New Haven GOTV field experiment and argues the implemented treatment and control groups were not balanced as a randomized design requires; applying matching and corrected statistical methods, claims that phone calls in fact produced large positive turnout effects, contradicting the original null result and highlighting the consequences of statistical/computational choices in experiments.
Critiques: The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout: A Field Experiment (Gerber & Green, 2000)
identificationstatisticsmethodsdata_codereproducibilityclaims
Crossref-verified: DOI resolves in Crossref to this exact title in American Political Science Review (2005), indexed as journal-article.
Tier ASociology (family/demography) · 2015
Measurement, methods, and divergent patterns: Reassessing the effects of same-sex parents
Simon Cheng, Brian Powell · Social Science Research · 2015
Reanalyzes Regnerus's (2012) New Family Structures Study and shows his negative findings for the adult children of parents who had a same-sex relationship are fragile. At least a third to two-fifths of the 236 same-sex-parent cases are misclassified, and the results further hinge on contested measurement and coding choices (outcome recoding, the comparison category, sociodemographic controls, multiple imputation). Correcting the misclassification and these choices renders most of the associations statistically insignificant.
Critiques: How different are the adult children of parents who have same-sex relationships? Findings from the New Family Structures Study
data_codemethodsstatisticsreproducibilityclaimsoverclaiming
Crossref-verified: DOI resolves in Crossref to this exact title in Social Science Research (2015), indexed as journal-article.
Tier SSociology (neighborhood effects / urban poverty) · 2008
Neighborhood Effects on Economic Self-Sufficiency: A Reconsideration of the Moving to Opportunity Experiment
Susan Clampet-Lundquist, Douglas S. Massey · American Journal of Sociology · 2008
Reconsiders the influential Moving to Opportunity housing-voucher experiment's conclusion of null neighborhood effects on adult economic self-sufficiency, arguing that the intention-to-treat design and treatment definition mask real effects; using duration and quality of neighborhood exposure, the authors find evidence that sustained exposure to lower-poverty neighborhoods does improve economic outcomes.
Critiques: Moving to Opportunity / experimental analyses concluding null neighborhood effects on economic self-sufficiency (Kling, Liebman, and Katz; MTO interim impacts evaluation)
identificationmethodsclaimsstatisticsreproducibility
Crossref-verified: DOI resolves in Crossref to this exact title in American Journal of Sociology (2008), indexed as journal-article.
Tier APolitical science (political methodology / causal inference) · 2011
Elections and the Regression Discontinuity Design: Lessons from Close U.S. House Races, 1942-2008
Devin Caughey, Jasjeet S. Sekhon · Political Analysis · 2011
Replicating close U.S. House races, it shows bare winners and bare losers differ markedly on pretreatment covariates (financial, experience, and incumbency advantages), undermining the as-if-random assumption underpinning Lee-style regression discontinuity designs for elections. It attributes the imbalance to sorting via activities on or before Election Day rather than post-election manipulation.
Critiques: Randomized Experiments from Non-random Selection in U.S. House Elections
identificationmethodsstatisticsreproducibilitygeneralisation
Crossref-verified: DOI resolves in Crossref to this exact title in Political Analysis (2011), indexed as journal-article.
Tier SPolitical science (voting behavior / retrospective voting) · 2018
Do Shark Attacks Influence Presidential Elections? Reassessing a Prominent Finding on Voter Competence
Anthony Fowler, Andrew B. Hall · The Journal of Politics · 2018
Reanalyzing Achen and Bartels's claim that 1916 New Jersey shark attacks cost Woodrow Wilson roughly ten points in beach communities, it finds the county-level effect shrinks and weakens under alternative specifications and the town-level Ocean County result largely vanishes once coding errors are corrected. It concludes there is little compelling evidence that shark attacks influenced the election, casting doubt on this prominent 'blind retrospection' demonstration of voter incompetence.
Critiques: Blind Retrospection: Electoral Responses to Drought, Flu, and Shark Attacks (in Democracy for Realists)
statisticsdata_codeclaimsoverclaimingreproducibility
Crossref-verified: DOI resolves in Crossref to this exact title in The Journal of Politics (2018), indexed as journal-article.
Tier SFinance (asset pricing / mutual fund performance) · 2019
Reassessing False Discoveries in Mutual Fund Performance: Skill, Luck, or Lack of Power?
Angie Andrikogiannopoulou, Filippos Papakonstantinou · The Journal of Finance · 2019
Reanalyzes the false-discovery-rate (FDR) method that Barras, Scaillet, and Wermers (2010) use to separate skilled, zero-alpha, and unskilled mutual funds. Andrikogiannopoulou and Papakonstantinou show via simulation that the FDR estimator is severely biased and underpowered at empirically relevant sample sizes, drastically overstating the fraction of zero-alpha funds and understating the proportion of skilled and unskilled funds.
Critiques: False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas — Laurent Barras, Olivier Scaillet, Russ Wermers
statisticsmethodsreproducibilityclaims
Crossref-verified: DOI resolves in Crossref to this exact title in The Journal of Finance (2019), indexed as journal-article.
Tier SMachine learning fairness / health policy (algorithmic bias in healthcare) · 2019AI-related target
Dissecting racial bias in an algorithm used to manage the health of populations
Ziad Obermeyer, Brian Powers, Christine Vogeli, Sendhil Mullainathan · Science · 2019
The piece challenges a widely deployed commercial population-health algorithm that uses health-care cost as a proxy for health need, showing it exhibits significant racial bias: at any given risk score Black patients are sicker (more uncontrolled illness) than White patients, because unequal access means less money is spent on Black patients despite equal need. It concludes that remedying this would raise the share of Black patients flagged for extra help from 17.7% to 46.5%, and generalizes that choosing convenient proxies for ground truth (here, cost for illness) is an important and underappreciated source of algorithmic bias.
Critiques: A widely deployed commercial population-health risk-prediction algorithm that uses health-care cost as a proxy for health need
claimsdata_codegeneralisationstatistics
Crossref-verified: DOI resolves in Crossref to "Dissecting racial bias in an algorithm used to manage the health of populations" in Science (2019), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22).
Tier AAI and Law · 2024AI-related target
Re-evaluating GPT-4’s bar exam performance
Eric Martínez · Artificial Intelligence and Law · 2024
It challenges OpenAI's headline claim that GPT-4 scored at the 90th percentile on the Uniform Bar Exam, arguing the figure is overinflated because it relies on a skewed February repeat-taker comparison group; the paper estimates GPT-4's percentile drops to roughly the 62nd percentile against first-time takers and roughly the 48th percentile (about 15th on essays) against those who actually passed, and it questions the validity of the reported essay/scaled (298) score while finding that few-shot chain-of-thought prompting, but not temperature, significantly affects MBE performance.
Critiques: GPT-4 passes the bar exam — Daniel Martin Katz, Michael James Bommarito, Shang Gao et al. (2024)
methodsstatisticsclaimsreproducibilityoverclaiminggeneralisation
Crossref-verified: DOI resolves in Crossref to "Re-evaluating GPT-4’s bar exam performance" in Artificial Intelligence and Law (2024), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22).
Tier AMedical imaging / machine learning (AI in radiology) · 2021AI-related target
AI for radiographic COVID-19 detection selects shortcuts over signal
Alex J. DeGrave, Joseph D. Janizek, Su-In Lee · Nature Machine Intelligence · 2021
Using explainable-AI techniques, the authors re-examine published deep-learning systems claiming accurate COVID-19 detection from chest radiographs and find they rely on confounding "shortcuts" rather than medical pathology, so they appear accurate but fail in new hospitals. They further argue that external-data evaluation is insufficient to detect this, since the spurious shortcuts may not degrade performance even in new hospitals.
Critiques: Recently reported deep-learning AI systems that claim to accurately detect COVID-19 from chest radiographs (and, by extension, related CT and other medical-imaging systems trained via the same data-collection approach).
methodsidentificationgeneralisationreproducibilityoverclaimingclaims
Crossref-verified: DOI resolves in Crossref to "AI for radiographic COVID-19 detection selects shortcuts over signal" in Nature Machine Intelligence (2021). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier BMachine learning fairness / criminal justice risk assessment · 2017AI-related target
Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments
Alexandra Chouldechova · Big Data · 2017
It challenges the assumption that an RPI can simultaneously satisfy all of the several recently-proposed fairness criteria, demonstrating that these criteria are mutually incompatible whenever recidivism prevalence differs across groups, and shows that disparate impact can arise when an instrument fails to achieve error-rate balance.
Critiques: Recidivism prediction instruments (RPIs) and the recently-applied fairness criteria used to assess them; implicitly the dispute over whether such instruments (e.g., the kind at the center of the controversy referenced) exhibit discriminatory bias.
statisticsclaimsmethodstheory
Crossref-verified: DOI resolves in Crossref to "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments" in Big Data (2017). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier AMedical imaging / clinical machine learning (radiology AI) · 2019AI-related target
Deep learning predicts hip fracture using confounding patient and healthcare variables
Marcus A. Badgeley, John R. Zech, Luke Oakden-Rayner, et al. · npj Digital Medicine · 2019
It challenges the validity of deep-learning CAD models for hip-fracture detection by showing that the model also predicts patient traits and 14 hospital process variables (e.g., scanner model AUC=1.00, "priority" order AUC=0.79) from the same radiographs, and that fracture-prediction performance collapses to random (AUC=0.52) once fracture risk is balanced across these patient and process variables — indicating the model's apparent accuracy is largely driven by confounding shortcuts rather than genuine fracture features.
Critiques: Computer-aided diagnosis / deep-learning models that claim to detect hip fractures from pelvic radiographs, by re-examining what image features such models actually leverage.
identificationmethodsclaimsoverclaiminggeneralisation
Crossref-verified: DOI resolves in Crossref to "Deep learning predicts hip fracture using confounding patient and healthcare variables" in npj Digital Medicine (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier ADevelopment economics / empirical macroeconomics · 2012
Counting Chickens when they Hatch: Timing and the Effects of Aid on Growth
Michael A. Clemens, Steven Radelet, Rikhil R. Bhavnani, Samuel Bazzi · The Economic Journal · 2012
It challenges the divergent cross-country aid-growth estimates of three influential prior studies by re-running their exact regression specifications while adding realistic lag assumptions about aid's timing and dropping invalid or weak instruments. With these changes all three designs converge on the finding that increases in aid are followed by modest increases in investment and growth, implying aid causes some modest growth that varies across recipients and diminishes at high aid levels.
Critiques: The three most influential published cross-country aid-growth studies (referenced collectively; the proposed target names the Aid, Policies, and Growth / Burnside-Dollar-style aid-growth literature), whose regression designs the authors re-estimate.
identificationmethodsstatisticsclaims
Crossref-verified: DOI resolves in Crossref to "Counting Chickens when they Hatch: Timing and the Effects of Aid on Growth" in The Economic Journal (2012). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier AMedical imaging / machine learning in radiology (computer-aided diagnosis) · 2018AI-related target
Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study
John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa et al. · PLOS Medicine · 2018
It challenges the assumption that pneumonia-detection CNNs generalize across hospitals, showing models performed better internally than externally in 3 of 5 comparisons and that differing pneumonia prevalence between sites let a model reach AUC 0.861 merely by sorting hospital. It further shows CNNs can identify the source hospital from a radiograph with ~99.95-99.98% accuracy, implying reported accuracy may reflect site-specific confounding rather than true pathology detection.
Critiques: A class/body of prior deep-learning work claiming high diagnostic accuracy for CNN-based pneumonia detection on chest radiographs, and the broader assumption that such image-classification CNNs generalize well to new data. The abstract references "recent work" and prior optimism but does not name a specific paper or system (e.g., it does not mention CheXNet by name).
generalisationmethodsidentificationoverclaimingclaimsstatistics
Crossref-verified: DOI resolves in Crossref to "Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study" in PLOS Medicine (2018). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.
Tier SEconomics (health economics / applied econometrics) · 2011
Saving Babies? Revisiting the Effect of Very Low Birth Weight Classification
Alan I. Barreca, Melanie Guldi, Jason M. Lindo, Glen R. Waddell · The Quarterly Journal of Economics · 2011
The paper challenges ADKW (2010)'s RD estimate that crossing the 1,500-g VLBW threshold reduces 1-year infant mortality by about one percentage point, showing that because the running variable exhibits extensive heaping at 1-oz and 100-g multiples, the point estimate is highly sensitive to dropping observations near the threshold. It concludes this sensitivity weakens confidence in the original, policy-relevant result.
Critiques: Almond, Doyle, Kowalski & Williams (2010), "Estimating Marginal Returns to Medical Care: Evidence from At-Risk Newborns" (QJE) — its RD finding that 1-year infant mortality drops ~1pp as birth weight crosses the 1,500-g VLBW threshold
identificationmethodsstatisticsreproducibilityclaimsoverclaiming
Crossref-verified: DOI 10.1093/qje/qjr042 confirmed by the downloaded full-text PDF (QJE 126(4):2117–2123, 2011). whatItChallenges grounded in the article's abstract+intro (full text) via the faithfulness ingestion gate (2026-06-22); gate refined the proposed type comment → reanalysis (it re-estimates the original RD design). Source: user-supplied PDF download.
Tier BEconomics (economics of education / labor economics) · 1999
Further estimates of the economic return to schooling from a new sample of twins
Cecilia Elena Rouse · Economics of Education Review · 1999
Re-examining Ashenfelter & Krueger's twins estimates using three additional years of the same survey, the paper finds the within-twin return estimate is smaller than the cross-sectional one (implying a small upward bias in the cross-section), reversing Ashenfelter & Krueger's reported pattern — though their measurement-error-corrected estimates are statistically indistinguishable from these. It also finds evidence of an important individual-specific component to measurement error in schooling reports.
Critiques: Ashenfelter & Krueger (1994), "Estimates of the Economic Return to Schooling from a New Sample of Twins" (AER)
statisticsmethodsidentificationdata_code
Crossref-verified: DOI 10.1016/S0272-7757(98)00038-7 confirmed by the downloaded full-text PDF (Economics of Education Review 18:149–157, 1999; PII S0272-7757(98)00038-7). whatItChallenges grounded in the article's abstract (full text) via the faithfulness ingestion gate (2026-06-22); genuine reanalysis of Ashenfelter & Krueger (1994). Source: user-supplied PDF download.

Critical commentary13

A critical commentary that challenges the target's framing, inference, or generalisation without a formal Comment slot.

Tier SSociology / computational social science · 2020AI-related target
What failure to predict life outcomes can teach us
Filiz Garip · Proceedings of the National Academy of Sciences · 2020
An invited PNAS commentary on Salganik et al.'s Fragile Families Challenge, arguing that the mass-collaboration finding that machine-learning models barely beat a simple benchmark exposes real limits of predictive ML in social science, and that the value lies in the common-task framework and out-of-sample testing rather than in any individual model's accuracy. It reframes the celebrated ML exercise as evidence of how little predictive purchase rich data plus ML actually buys for individual life outcomes.
Critiques: Measuring the predictability of life outcomes with a scientific mass collaboration
methodsclaimsoverclaiminggeneralisationreproducibility
Crossref-verified: DOI resolves in Crossref to this exact title in Proceedings of the National Academy of Sciences (2020), indexed as journal-article.
Tier SSociology (organizational ecology) · 1991
Density Dependence in Organizational Mortality: Legitimacy or Unobserved Heterogeneity?
Trond Petersen, Kenneth W. Koput · American Sociological Review · 1991
Challenges the standard interpretation of density-dependence tests in organizational ecology, arguing the observed negative first-order effect of organizational density on mortality rates is equally consistent with unobserved heterogeneity (selection) rather than the theorized legitimation process, undermining the causal-theoretical reading of Hannan and Carroll's models.
Critiques: Density Dependence in the Evolution of Populations of Newspaper Organizations — Glenn R. Carroll, Michael T. Hannan
statisticsidentificationmethodstheoryclaims
Crossref-verified: DOI resolves in Crossref to this exact title in American Sociological Review (1991), indexed as journal-article.
Tier SPolitical science (political methodology / survey experiments) · 2022
What Do We Learn about Voter Preferences from Conjoint Experiments?
Scott F. Abramson, Korhan Kocak, Asya Magazinnik · American Journal of Political Science · 2022
It shows that the average marginal component effect (AMCE), the central estimand of conjoint experiments popularized by Hainmueller, Hopkins, and Yamamoto, is not well defined in terms of majority preferences: even with rational subjects a positive AMCE can point opposite to the true majority preference, so AMCEs do not license common claims about what voters prefer. It argues the estimand conflates direction and intensity of preferences across respondents.
Critiques: Causal Inference in Conjoint Analysis: Understanding Multidimensional Choices via Stated Preference Experiments
theorymethodsclaimsoverclaimingstatistics
Crossref-verified: DOI resolves in Crossref to this exact title in American Journal of Political Science (2022), indexed as journal-article.
Tier AEducation / Educational Psychology · 2018
What Shall We Do About Grit? A Critical Review of What We Know and What We Don't Know
Marcus Credé · Educational Researcher · 2018
Critically reviews the grit literature popularized by Angela Duckworth, arguing the empirical evidence does not justify combining passion and perseverance into a single construct, that grit predicts academic performance only weakly (and no better than conscientiousness, a jangle-fallacy concern), and that there is no evidence grit interventions work.
Critiques: Grit: Perseverance and Passion for Long-Term Goals (Duckworth, Peterson, Matthews, & Kelly)
statisticsclaimsoverclaimingtheorynovelty
Crossref-verified: DOI resolves in Crossref to this exact title in Educational Researcher (2018), indexed as journal-article.
Tier SCriminology · 2010
Poverty, Infant Mortality, and Homicide Rates in Cross-National Perspective: Assessments of Criterion and Construct Validity
Steven F. Messner, Lawrence E. Raffalovich, Gretchen M. Sutton · Criminology · 2010
Responds to Pridemore's critique of using infant mortality as a proxy for poverty in cross-national homicide research. Rather than re-running his analysis, the authors assemble a new 16-nation panel (1993-2000) with direct income-based poverty measures and find infant mortality correlates more strongly with relative than absolute poverty, arguing disadvantage is best treated as a multidimensional construct — a qualified, collaborative response rather than a wholesale rejection of the proxy.
Critiques: A Methodological Addition to the Cross-National Empirical Literature on Social Structure and Homicide: A First Test of the Poverty-Homicide Thesis — William Alex Pridemore
statisticsmethodsclaims
Crossref-verified: DOI resolves in Crossref to this exact title in Criminology (2010), indexed as journal-article.
Tier ANatural language processing / AI ethics · 2023AI-related target
GPT detectors are biased against non-native English writers
Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu et al. · Patterns · 2023
Challenges the reliability and fairness of GPT/AI-text detectors by showing they frequently misclassify non-native English writing as AI-generated; concludes this bias threatens to marginalize non-native English speakers in evaluative and educational settings and must be addressed for an equitable digital landscape.
Critiques: Commercial and academic GPT/AI-text detectors claiming reliable detection of machine-generated text
claimsgeneralisationoverclaimingmethods
Crossref-verified: DOI resolves in Crossref to "GPT detectors are biased against non-native English writers" in Patterns (2023), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22). Note: abstract is brief; summary kept to what the abstract licenses.
Tier Bcriminology / data science (algorithmic fairness in policing) · 2016AI-related target
To Predict and Serve?
Kristian Lum, William Isaac · Significance · 2016
The piece challenges the premise that place-based predictive-policing systems deliver objective, bias-free crime prediction, arguing that because these systems are trained on biased data, their outputs and resulting deployment can reproduce that bias with adverse social consequences. The abstract frames this as an examination of the evidence and social costs rather than reporting a specific empirical result.
Critiques: Place-based predictive-policing systems (PredPol-style) claiming objective, bias-free crime prediction
data_codeclaimsoverclaimingmethods
Crossref-verified: DOI resolves in Crossref to "To Predict and Serve?" in Significance (2016), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22). Note: abstract is brief; summary kept to what the abstract licenses.
Tier AMedical artificial intelligence / diagnostic imaging (evidence synthesis) · 2019AI-related target
A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis
Xiaoxuan Liu, Livia Faes, Aditya U. Kale, Siegfried K. Wagner et al. · The Lancet Digital Health · 2019
Through a systematic review and meta-analysis of 82 studies, it finds that while deep learning diagnostic performance appears equivalent to health-care professionals, very few studies used external validation or compared algorithms and clinicians on the same sample, and poor reporting is pervasive — undermining confidence in the field's accuracy claims. It thus challenges the reliability and generalisability of the existing deep-learning-versus-clinician comparison literature rather than affirming it at face value.
Critiques: The body of deep learning diagnostic-imaging studies (2012–2019) that report deep learning algorithm performance in disease classification relative to health-care professionals, particularly those claiming equivalent or superior diagnostic accuracy.
methodsreproducibilitygeneralisationoverclaimingclaimsstatistics
Crossref-verified: DOI resolves in Crossref to "A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis" in The Lancet Digital Health (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier AMedical machine learning / radiology AI (COVID-19 imaging) · 2021AI-related target
Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans
Michael Roberts, Derek Driggs, Matthew Thorpe, et al. (AIX-COVNET) · Nature Machine Intelligence · 2021
Through a systematic review of all CXR/CT machine-learning COVID-19 models published between 1 Jan and 3 Oct 2020, it finds that none of the 62 included models is of potential clinical use due to methodological flaws and/or underlying biases, and it issues recommendations to remedy these problems.
Critiques: The body of 2020 machine-learning models published as papers/preprints (62 studies, screened from 2,212) claiming to diagnose or prognosticate COVID-19 from chest X-ray (CXR) and CT images.
methodsdata_codereproducibilityoverclaiminggeneralisationclaims
Crossref-verified: DOI resolves in Crossref to "Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans" in Nature Machine Intelligence (2021). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier BCommunication / political communication / media studies · 2018AI-related target
The echo chamber is overstated: the moderating effect of political interest and diverse media
Elizabeth Dubois, Grant Blank · Information, Communication & Society · 2018
Using a nationally representative survey of UK adult internet users (N=2000) and five echo-chamber measures, the paper challenges the prevailing echo-chamber thesis by showing that politically interested people and those with diverse media diets tend to avoid echo chambers, so only a small population segment is actually caught in one. It further argues that single-medium studies and those using narrow definitions/measurements are flawed because they fail to test the theory in a realistic multi-media environment.
Critiques: The echo-chamber / filter-bubble thesis that, in a high-choice media environment, individuals (especially the politically interested) select self-reinforcing content and become segregated into homogeneous, partisan information environments — including prior single-medium studies that operationalize "being in an echo chamber" through narrow definitions and measurements.
claimsmethodsoverclaimingtheorygeneralisation
Crossref-verified: DOI resolves in Crossref to "The echo chamber is overstated: the moderating effect of political interest and diverse media" in Information, Communication & Society (2018). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).
Tier SData science / computational epidemiology · 2014AI-related target
The Parable of Google Flu: Traps in Big Data Analysis
David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani · Science · 2014
The piece challenges the celebrated big-data system Google Flu Trends, noting that despite being built to predict CDC influenza-like-illness reports, in February 2013 it predicted more than double the proportion of doctor visits that the CDC reported. It argues these large, largely avoidable prediction errors hold broader lessons about the pitfalls of big data analysis.
Critiques: Google Flu Trends (GFT), the search-query-based influenza-tracking system built to predict CDC influenza-like-illness estimates, widely cited as an exemplary use of big data.
methodsclaimsoverclaimingreproducibility
Crossref-verified: DOI resolves in Crossref to "The Parable of Google Flu: Traps in Big Data Analysis" in Science (2014). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.
Tier AMachine learning / AI ethics and interpretability · 2019AI-related target
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
Cynthia Rudin · Nature Machine Intelligence · 2019
It argues that attempting to explain black-box models post hoc, rather than building inherently interpretable models, perpetuates bad practice and risks great societal harm in high-stakes settings. It contends that for applications directly affecting human lives (healthcare, criminal justice), effort should go toward inherently interpretable models, which it claims could often replace black boxes.
Critiques: The practice of using "explainable AI" methods to post-hoc explain black-box machine learning models deployed for high-stakes decisions, named at the domain level (criminal justice, healthcare, computer vision); the abstract does not name COMPAS specifically.
methodsclaimstheoryoverclaiming
Crossref-verified: DOI resolves in Crossref to "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead" in Nature Machine Intelligence (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.
Tier SMedicine (clinical algorithms) · 2020AI-related target
Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms
Darshali A. Vyas, Leo G. Eisenstein, David S. Jones · New England Journal of Medicine · 2020
It challenges the embedded practice of race correction in clinical diagnostic algorithms and guidelines, arguing that adjusting outputs on the basis of race or ethnicity may steer more clinical attention or resources toward white patients than toward racial and ethnic minority patients.
Critiques: Diagnostic algorithms and clinical practice guidelines that apply race correction to their outputs
methodsclaimsgeneralisation
Crossref-verified: DOI resolves to this exact title in N Engl J Med 383:874–882 (2020); abstract captured from the publisher landing page (nejm.org). whatItChallenges grounded in the (brief, single-sentence) abstract via the faithfulness ingestion gate (2026-06-22); genre (critical commentary on a class of race-corrected clinical algorithms) confirmed. Abstract characterizes a category of tools collectively rather than re-analyzing one named study.

How these calibrate Critical AI

Each benchmark is tagged with the critique dimensionsit exercises — identification, statistics, reproducibility, overclaiming, generalisation, and so on. Those are the same dimensions Critical AI’s own critique pipeline works through. A published Comment that overturns a headline result by re-running its regressions is the concrete standard our critiques are measured against: specific, sourced, and falsifiable, never a verdict on motive. The corpus skews toward the most-cited exemplars of the genre so the bar is set high.

Every DOI here resolved through Crossref at ingestion. The benchmark set grows as more verified exemplars are added; the machine-readable list is at /critique/api/benchmarks.

Published critiques, as the standard to meet

68 verified critiques21 on AI/algorithmic targetsacross 48 venues · 61 fields10 critique dimensions exercisedJSON ↗

How these calibrate Critical AI

Every DOI here resolved through Crossref at ingestion. The benchmark set grows as more verified exemplars are added; the machine-readable list is at /critique/api/benchmarks.