Benchmarks · the calibration corpus
Published critiques, as the standard to meet
Critique is not a novelty Critical AI invented. The strongest journals in every social science publish formal Comments, author Replies, replication studies and reanalyses — adversarial scholarship that contests a specific claim in print. This page collects real exemplars of that genre as a calibration corpus: the bar an AI-native critique should clear. Each one’s DOI is independently Crossref-verified, because a benchmark built on a fabricated citation would be worse than none.
Comment10
A formal Comment published in the same journal, contesting a specific claim, method, or result of the target paper.
- Tier SApplied Microeconomics / Crime · 2008
The Impact of Legalized Abortion on Crime: Comment
Christopher L. Foote, Christopher F. Goetz · Quarterly Journal of Economics · 2008
The comment identifies a coding mistake in the within-state cohort regressions of Donohue and Levitt's abortion-crime paper and shows that correcting it and using a per-capita crime specification sharply weakens the results. It also shows the cross-state tests are not robust to allowing differential state trends.
Critiques: The Impact of Legalized Abortion on Crime — John J. Donohue III, Steven D. Levitt
methodsidentificationdata_codestatisticsclaimsreproducibilityCrossref-verified: DOI resolves in Crossref to this exact title in Quarterly Journal of Economics (2008), indexed as journal-article.
- Tier SDevelopment Economics / Institutions · 2012
The Colonial Origins of Comparative Development: An Empirical Investigation: Comment
David Y. Albouy · American Economic Review · 2012
Albouy shows that 36 of the 64 countries are assigned settler-mortality rates borrowed from other countries and that incomparable rates from laborers, bishops, and soldiers on campaign are combined in ways favorable to the institutions hypothesis. Once data problems are addressed, the mortality-expropriation relationship and the instrumental-variable estimates lose robustness, often yielding effectively infinite confidence intervals.
Critiques: The Colonial Origins of Comparative Development: An Empirical Investigation — Daron Acemoglu, Simon Johnson, James A. Robinson
data_codeidentificationmethodsstatisticsclaimsgeneralisationCrossref-verified: DOI resolves in Crossref to this exact title in American Economic Review (2012), indexed as journal-article.
- Tier SPsychology / metascience · 2016
Comment on "Estimating the reproducibility of psychological science"
Daniel T. Gilbert, Gary King, Stephen Pettigrew, Timothy D. Wilson · Science · 2016
Argues the Open Science Collaboration's Reproducibility Project contains three statistical errors (low-power replications, non-representative study sampling, and misleading endorsement criteria) that bias the reported replication rate downward. Concludes the data are actually consistent with very high reproducibility, not the low rate the original claimed.
Critiques: Estimating the reproducibility of psychological science — Open Science Collaboration
statisticsmethodsidentificationclaimsreproducibilitygeneralisationCrossref-verified: DOI resolves in Crossref to this exact title in Science (2016), indexed as journal-article.
- Tier SStrategic Management / Organization · 2016
Reaffirming the CEO Effect Is Significant and Much Larger than Chance: A Comment on Fitza (2014)
Timothy J. Quigley, Scott D. Graffin · Strategic Management Journal · 2016
Challenges Fitza's (2014) claim that the estimated 'CEO effect' on firm performance is almost entirely an artifact of random chance, arguing his simulation/variance-decomposition approach mis-specifies the chance baseline. Using corrected methods, they conclude the CEO effect is statistically significant and substantively much larger than chance.
Critiques: The use of variance decomposition in the investigation of CEO effects: How large must the CEO effect be to rule out chance? — Markus A. Fitza
methodsidentificationstatisticsreproducibilityCrossref-verified: DOI resolves in Crossref to this exact title in Strategic Management Journal (2016), indexed as journal-article.
- Tier ASociology (immigration/assimilation) · 2011
Commentary: The Kids Are (Mostly) Alright: Second-Generation Assimilation: Comments on Haller, Portes and Lynch
Richard Alba, Philip Kasinitz, Mary C. Waters · Social Forces · 2011
Challenges Haller, Portes and Lynch's pessimistic 'segmented assimilation / downward assimilation' thesis about the immigrant second generation, arguing that their data, model specification and interpretation overstate the prevalence and inevitability of downward mobility, and that the bulk of second-generation outcomes are in fact reasonably positive.
Critiques: Dreams Fulfilled, Dreams Shattered: Determinants of Segmented Assimilation in the Second Generation
claimstheorymethodsgeneralisationoverclaimingCrossref-verified: DOI resolves in Crossref to this exact title in Social Forces (2011), indexed as journal-article.
- Tier AEducation / Special Education Policy · 2016
Risks and Consequences of Oversimplifying Educational Inequities: A Response to Morgan et al. (2015)
Russell J. Skiba, Alfredo J. Artiles, Elizabeth B. Kozleski, Daniel J. Losen et al. · Educational Researcher · 2016
Directly challenges Morgan et al.'s widely-cited claim that racial/ethnic minorities are underrepresented (not overrepresented) in special education, arguing the conclusion is in error due to sampling and model-specification choices, the heavy covariate adjustment that conditions away the inequities of interest, and failure to engage the broader complexity of disproportionality.
methodsidentificationstatisticsclaimsoverclaiminggeneralisationCrossref-verified: DOI resolves in Crossref to this exact title in Educational Researcher (2016), indexed as journal-article.
- Tier SEconomics (education / labor economics) · 2017
Measuring the Impacts of Teachers: Comment
Jesse Rothstein · American Economic Review · 2017
Challenges Chetty, Friedman, and Rockoff's (2014) claim that the teacher-switching quasi-experiment shows student sorting creates negligible bias in teacher value-added (VA) scores. Rothstein shows teacher switching is correlated with changes in prior student preparedness, so the design is invalid; correcting for this reveals moderate VA bias (10-35% of teachers' causal effect variance) and shows long-run results are fragile to control choices.
Critiques: Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates — Raj Chetty, John N. Friedman, Jonah E. Rockoff
identificationmethodsstatisticsreproducibilityclaimsCrossref-verified: DOI resolves in Crossref to this exact title in American Economic Review (2017), indexed as journal-article.
- Tier SEconomics (labor economics) · 2000
Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Comment
David Neumark, William Wascher · American Economic Review · 2000
Re-examines Card and Krueger's (1994) finding that a New Jersey minimum-wage increase did not reduce (and may have raised) fast-food employment. Using administrative payroll records rather than the original telephone-survey data, Neumark and Wascher find that employment fell after the minimum-wage rise, contradicting the original positive/zero estimate and attributing the discrepancy to measurement error in the survey data.
Critiques: Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania — David Card, Alan B. Krueger
data_codeidentificationmethodsclaimsreproducibilityCrossref-verified: DOI resolves in Crossref to this exact title in American Economic Review (2000), indexed as journal-article.
- Tier AQuantitative criminology / public policy · 2001
Impossible Policy Evaluations and Impossible Conclusions: A Comment on Koper and Roth
Gary Kleck · Journal of Quantitative Criminology · 2001
Argues that Koper and Roth's evaluation of the 1994 federal assault weapons ban could not, by design, detect any plausible effect because assault weapons figure in so few homicides, so the data lack statistical power to support any conclusion. Contends their tentative inference that the ban may have reduced gun homicides is essentially impossible to sustain from the evidence presented.
Critiques: The Impact of the 1994 Federal Assault Weapon Ban on Gun Violence Outcomes: An Assessment of Multiple Outcome Measures and Some Lessons for Policy Evaluation — Christopher S. Koper, Jeffrey A. Roth
statisticsidentificationmethodsclaimsoverclaimingCrossref-verified: DOI resolves in Crossref to this exact title in Journal of Quantitative Criminology (2001), indexed as journal-article.
- Tier SCriminology · 2003
Long-Term Trends in Crimes of Violence (Comment on Cooney, 2003)
David F. Greenberg · Criminology · 2003
Challenges Cooney's (2003) historical thesis that violence has been 'privatized' (shifting from elite/collective to marginal/individual actors), arguing the long-term empirical evidence on violent-crime trends and the social characteristics of offenders does not support the claimed qualitative transformation. Questions the selectivity and interpretation of the historical and anthropological evidence Cooney marshals.
Critiques: The Privatization of Violence — Mark Cooney
theoryclaimsmethodsgeneralisationdata_codeCrossref-verified: DOI resolves in Crossref to this exact title in Criminology (2003), indexed as journal-article.
Reply / Rejoinder1
The original authors' Reply or Rejoinder — the other half of an adversarial exchange the journal published in full.
- Tier APolitical communication / media exposure measurement · 2013
The Challenge of Measuring Media Exposure: Reply to Dilliplane, Goldman, and Mutz
Markus Prior · Political Communication · 2013
Prior critiques the program-list measure of televised political exposure that Dilliplane, Goldman, and Mutz proposed (and which the ANES adopted), arguing it has low construct validity because it never measures the amount of exposure and shows poor convergent validity by several criteria. He contends the measure conflates recall/recognition with exposure and overstates the predictive payoff of the new instrument.
Critiques: Televised Exposure to Politics: New Measures for a Fragmented Media Environment
methodsstatisticsclaimsgeneralisationCrossref-verified: DOI resolves in Crossref to this exact title in Political Communication (2013), indexed as journal-article.
Rejoinder3
A rejoinder closing out a published exchange.
- Tier SStrategic Management / Organization · 2016
How Much Do CEOs Really Matter? Reaffirming That the CEO Effect Is Mostly Due to Chance
Markus A. Fitza · Strategic Management Journal · 2016
Rejoinder defending the original conclusion against Quigley and Graffin's comment, arguing that once more realistic assumptions about how chance affects firm performance are imposed, the apparent CEO effect is statistically indistinguishable from chance regardless of the estimation methodology used.
methodsidentificationstatisticsclaimsCrossref-verified: DOI resolves in Crossref to this exact title in Strategic Management Journal (2016), indexed as journal-article.
- Tier SPolitical Science (experimental methods / voter mobilization) · 2005
Correction to Gerber and Green (2000), Replication of Disputed Findings, and Reply to Imai (2005)
Alan S. Gerber, Donald P. Green · American Political Science Review · 2005
Responds to Imai's (2005) reanalysis: acknowledges and repairs data-processing errors in the original 2000 article, then argues Imai's correction itself contains statistical, computational, and reporting errors that invalidate its conclusions. After fixes, the original substantive finding stands that brief phone calls do not meaningfully increase voter turnout.
statisticsmethodsdata_codereproducibilityclaimsCrossref-verified: DOI resolves in Crossref to this exact title in American Political Science Review (2005), indexed as journal-article.
- Tier AEducation / Mathematics Education Policy · 2008
Rejoinder to the Critiques of the National Mathematics Advisory Panel Final Report
Camilla Persson Benbow, Larry R. Faulkner · Educational Researcher · 2008
The Panel chair and co-chair rebut a cluster of published critiques (notably Boaler and Kelly) that attacked the Panel's restriction to randomized/quasi-experimental evidence on math curricula, instruction, and learning, defending the evidentiary standards and contesting claims that the report's methodology was inappropriate for educational field research.
methodsidentificationclaimsgeneralisationtheoryCrossref-verified: DOI resolves in Crossref to this exact title in Educational Researcher (2008), indexed as journal-article.
Replication study12
An independent attempt to reproduce the target's result with its data and code, or to repeat it in a new sample.
- Tier AMacroeconomics / Public Finance · 2013
Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff
Thomas Herndon, Michael Ash, Robert Pollin · Cambridge Journal of Economics · 2013
Replicating Reinhart and Rogoff's claim that public debt above 90% of GDP is associated with sharply lower growth, the authors find a spreadsheet coding error, selective exclusion of available country-year data, and unconventional weighting. Corrected, average real growth for high-debt countries is +2.2%, not the published -0.1%, eliminating the supposed debt threshold.
Critiques: Growth in a Time of Debt — Carmen M. Reinhart, Kenneth S. Rogoff
methodsdata_codereproducibilitystatisticsclaimsoverclaimingCrossref-verified: DOI resolves in Crossref to this exact title in Cambridge Journal of Economics (2013), indexed as journal-article.
- Tier ASocial/cognitive psychology (self-control) · 2016
A Multilab Preregistered Replication of the Ego-Depletion Effect
Martin S. Hagger, Nikos L. D. Chatzisarantis, Hugo Alberts, Calvin Octavianus Anggono · Perspectives on Psychological Science · 2016
A preregistered Registered Replication Report across 23 labs (~2000 participants) tested the sequential-task ego-depletion effect and found a meta-analytic effect indistinguishable from zero (d ≈ 0.04). Challenges the existence and robustness of the widely cited ego-depletion / limited-resource model of self-control.
Critiques: Methylphenidate Blocks Effort-Induced Depletion of Regulatory Control in Healthy Volunteers — Chandra Sripada, Daniel Kessler, John Jonides
reproducibilitymethodsstatisticsclaimsgeneralisationCrossref-verified: DOI resolves in Crossref to this exact title in Perspectives on Psychological Science (2016), indexed as journal-article.
- Tier ASocial psychology / emotion · 2016
Registered Replication Report: Strack, Martin, & Stepper (1988)
E.-J. Wagenmakers, Titia Beek, Laura Dijkhoff, Quentin F. Gronau · Perspectives on Psychological Science · 2016
A Registered Replication Report of 17 direct replications of the classic pen-in-mouth facial-feedback study found a pooled effect of 0.03 rating units (95% CI -0.11 to 0.16) versus the original 0.82, failing to replicate the claim that induced smiling increases rated funniness. Challenges a textbook facial-feedback finding.
Critiques: Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis — Fritz Strack, Leonard L. Martin, Sabine Stepper
reproducibilitymethodsstatisticsclaimstheoryCrossref-verified: DOI resolves in Crossref to this exact title in Perspectives on Psychological Science (2016), indexed as journal-article.
- Tier SSocial psychology / embodied cognition · 2015
Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women
Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg et al. · Psychological Science · 2015
A larger, better-powered replication (N=200) of the power-posing study replicated only self-reported feelings of power but found no effect of expansive postures on testosterone, cortisol, or behavioral risk tolerance. Challenges the central physiological and behavioral claims of the original power-posing paper.
Critiques: Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance — Dana R. Carney, Amy J. C. Cuddy, Andy J. Yap
reproducibilitymethodsstatisticsclaimsoverclaimingCrossref-verified: DOI resolves in Crossref to this exact title in Psychological Science (2015), indexed as journal-article.
- Tier APsychology / metascience · 2018
Many Labs 2: Investigating Variation in Replicability Across Samples and Settings
Richard A. Klein, Michelangelo Vianello, Fred Hasselman · Advances in Methods and Practices in Psychological Science · 2018
A large preregistered multi-site project replicated 28 published effects across 60+ samples and ~15,000 participants; only about half replicated robustly and variation across samples/settings was generally small, implying non-replication reflects original effects rather than hidden moderators. Challenges the robustness and breadth of numerous canonical findings.
Critiques: 28 classic and contemporary psychological findings (multi-target replication, e.g. Tversky & Kahneman framing, Schwarz heuristics, moral-judgment effects) — Various original authors
reproducibilitymethodsstatisticsgeneralisationclaimsCrossref-verified: DOI resolves in Crossref to this exact title in Advances in Methods and Practices in Psychological Science (2018), indexed as journal-article.
- Tier SStrategic Management / Organization · 2022
The "CEO in Context" Technique Revisited: A Replication and Extension of Hambrick and Quigley (2014)
Tobias Keller, Martin Glaum, Andreas Bausch, Thorsten Bunz · Strategic Management Journal · 2022
Replicates and extends the 'CEO in Context' technique on a far larger sample (33,996 firm-years vs 4,866) and broadly CONFIRMS the original's high CEO effect — attributing about a third of the variance in firm performance (ROA) to the CEO — while showing the estimate shrinks under an adjusted-R² specification, a within-paper robustness nuance rather than an overturning of the headline finding.
methodsstatisticsreproducibilitygeneralisationidentificationCrossref-verified: DOI resolves in Crossref to this exact title in Strategic Management Journal (2022), indexed as journal-article.
- Tier SPolitical Science (public opinion / political behavior) · 1998
Macropartisanship: A Replication and Critique
Donald Green, Bradley Palmquist, Eric Schickler · American Political Science Review · 1998
Replicates MacKuen, Erikson, and Stimson's claim that aggregate party identification swings substantially in response to short-term shocks like consumer sentiment and presidential approval. Using more extensive survey data and correcting for measurement error, finds the short-term partisan movement is two to three times smaller than originally reported, supporting a stable, slow-adjusting view of partisanship.
Critiques: Macropartisanship — Michael B. MacKuen, Robert S. Erikson, James A. Stimson
statisticsmethodsreproducibilityclaimsoverclaimingCrossref-verified: DOI resolves in Crossref to this exact title in American Political Science Review (1998), indexed as journal-article.
- Tier ASociology (political sociology / welfare state) · 2015
The Missing Main Effect of Welfare State Regimes: A Replication of 'Social Policy Responsiveness in Developed Democracies' by Brooks and Manza
Nate Breznau · Sociological Science · 2015
Replicates Brooks and Manza's (2006, ASR) claim that public opinion drives welfare-state spending and finds it rests on a model specification error: they included an opinion-by-welfare-regime interaction while omitting the main effect of welfare regime; restoring the missing main effect across more than 800 model configurations eliminates the original finding in roughly 99.5% of cases.
Critiques: Social Policy Responsiveness in Developed Democracies
statisticsmethodsreproducibilitydata_codeclaimsCrossref-verified: DOI resolves in Crossref to this exact title in Sociological Science (2015), indexed as journal-article.
- Tier SPolitical science (political methodology / causal inference) · 2014
On the Validity of the Regression Discontinuity Design for Estimating Electoral Effects: New Evidence from Over 40,000 Close Races
Andrew C. Eggers, Anthony Fowler, Jens Hainmueller, Andrew B. Hall et al. · American Journal of Political Science · 2014
Assembling over 40,000 close races across many electoral settings, it finds no systematic evidence of strategic sorting or covariate imbalance at the threshold, arguing the close-election RD design is generally valid and that the Caughey-Sekhon/Snyder imbalance is largely specific to postwar U.S. House races rather than a general flaw. It reframes the earlier critique as an unusual case rather than evidence against RD broadly.
Critiques: Elections and the Regression Discontinuity Design: Lessons from Close U.S. House Races, 1942-2008
identificationmethodsstatisticsreproducibilitygeneralisationclaimsCrossref-verified: DOI resolves in Crossref to this exact title in American Journal of Political Science (2014), indexed as journal-article.
- Tier SComputational social science / social-media text analysis · 2021AI-related target
Reconsidering evidence of moral contagion in online social networks
Jason W. Burton, Nicole Cruz, Ulrike Hahn · Nature Human Behaviour · 2021
Re-tests Brady et al.'s (2017) 'moral contagion' method on six new Twitter corpora rather than reanalysing their data, and finds via out-of-sample prediction, model comparison and specification-curve analysis that the moral-contagion model performs no better than an implausibly-named 'XYZ contagion' placebo — challenging the strength of the original correlational claim while conceding moral contagion may still exist.
Critiques: Emotion shapes the diffusion of moralized content in social networks
methodsstatisticsidentificationreproducibilityoverclaimingclaimsCrossref-verified: DOI resolves in Crossref to this exact title in Nature Human Behaviour (2021), indexed as journal-article.
- Tier ANew media / games studies / media effects · 2019
Game perspective-taking effects on willingness to help immigrants: A replication study with a Spanish sample
Jorge Peña, Juan Francisco Hernández Pérez · New Media & Society · 2019
A replication of a perspective-taking game study on willingness to help immigrants. The original reported reductions in behavioural intention, subjective norms and self-efficacy (attitudes were unaffected); the Spanish-sample replication reproduced the intention effect but not the subjective-norms or self-efficacy effects, while finding an attitude effect the original did not — partly corroborating and partly diverging from the original.
reproducibilitygeneralisationclaimsmethodsCrossref-verified: DOI resolves in Crossref to this exact title in New Media & Society (2019), indexed as journal-article.
- Tier APublic administration · 2021
A replication of "Representative bureaucracy and the willingness to coproduce"
Martin Sievert · Public Administration · 2021
A wide replication, on new data, of Riccucci, Van Ryzin and Li's (2016) survey experiment on representative bureaucracy and citizens' willingness to coproduce, testing whether the original's representation effects hold in a different national context rather than re-running the original data.
Critiques: Representative Bureaucracy and the Willingness to Coproduce: An Experimental Study — Norma M. Riccucci, Gregg G. Van Ryzin, Huafang Li
reproducibilitystatisticsmethodsgeneralisationclaimsCrossref-verified: DOI resolves in Crossref to this exact title in Public Administration (2021), indexed as journal-article.
Reanalysis8
A re-run of the target's own data under different — often more defensible — modelling choices.
- Tier ACriminology / algorithmic risk assessment · 2018AI-related target
The accuracy, fairness, and limits of predicting recidivism
Julia Dressel, Hany Farid · Science Advances · 2018
A widely cited reanalysis showing that the commercial COMPAS recidivism risk algorithm (137 features) is no more accurate or fair than predictions from untrained humans on Mechanical Turk (62% vs 65%), and that a simple two-feature linear classifier matches COMPAS's accuracy. It directly challenges claims that proprietary ML risk-assessment tools provide superior, sophisticated predictive power over simple baselines.
statisticsmethodsclaimsoverclaimingnoveltyCrossref-verified: DOI resolves in Crossref to this exact title in Science Advances (2018), indexed as journal-article.
- Tier APolitical science / computational social science (ML methodology) · 2023AI-related target
Leakage and the reproducibility crisis in machine-learning-based science
Sayash Kapoor, Arvind Narayanan · Patterns · 2023
A reproducibility audit identifying data leakage as a pervasive failure mode across 294 ML-based-science papers in 17 fields; its central social-science case study reproduces civil-war prediction papers and shows that, after correcting for leakage, complex ML models do not outperform decades-old logistic regression, overturning published claims of ML superiority. It challenges overclaimed ML performance and proposes model info sheets as a remedy.
reproducibilitydata_codemethodsstatisticsoverclaimingnoveltyCrossref-verified: DOI resolves in Crossref to this exact title in Patterns (2023), indexed as journal-article.
- Tier SPolitical Science (experimental methods / voter mobilization) · 2005
Do Get-Out-the-Vote Calls Reduce Turnout? The Importance of Statistical Methods for Field Experiments
Kosuke Imai · American Political Science Review · 2005
Reanalyzes Gerber and Green's influential New Haven GOTV field experiment and argues the implemented treatment and control groups were not balanced as a randomized design requires; applying matching and corrected statistical methods, claims that phone calls in fact produced large positive turnout effects, contradicting the original null result and highlighting the consequences of statistical/computational choices in experiments.
identificationstatisticsmethodsdata_codereproducibilityclaimsCrossref-verified: DOI resolves in Crossref to this exact title in American Political Science Review (2005), indexed as journal-article.
- Tier ASociology (family/demography) · 2015
Measurement, methods, and divergent patterns: Reassessing the effects of same-sex parents
Simon Cheng, Brian Powell · Social Science Research · 2015
Reanalyzes Regnerus's (2012) New Family Structures Study and shows his negative findings for the adult children of parents who had a same-sex relationship are fragile. At least a third to two-fifths of the 236 same-sex-parent cases are misclassified, and the results further hinge on contested measurement and coding choices (outcome recoding, the comparison category, sociodemographic controls, multiple imputation). Correcting the misclassification and these choices renders most of the associations statistically insignificant.
data_codemethodsstatisticsreproducibilityclaimsoverclaimingCrossref-verified: DOI resolves in Crossref to this exact title in Social Science Research (2015), indexed as journal-article.
- Tier SSociology (neighborhood effects / urban poverty) · 2008
Neighborhood Effects on Economic Self-Sufficiency: A Reconsideration of the Moving to Opportunity Experiment
Susan Clampet-Lundquist, Douglas S. Massey · American Journal of Sociology · 2008
Reconsiders the influential Moving to Opportunity housing-voucher experiment's conclusion of null neighborhood effects on adult economic self-sufficiency, arguing that the intention-to-treat design and treatment definition mask real effects; using duration and quality of neighborhood exposure, the authors find evidence that sustained exposure to lower-poverty neighborhoods does improve economic outcomes.
identificationmethodsclaimsstatisticsreproducibilityCrossref-verified: DOI resolves in Crossref to this exact title in American Journal of Sociology (2008), indexed as journal-article.
- Tier APolitical science (political methodology / causal inference) · 2011
Elections and the Regression Discontinuity Design: Lessons from Close U.S. House Races, 1942-2008
Devin Caughey, Jasjeet S. Sekhon · Political Analysis · 2011
Replicating close U.S. House races, it shows bare winners and bare losers differ markedly on pretreatment covariates (financial, experience, and incumbency advantages), undermining the as-if-random assumption underpinning Lee-style regression discontinuity designs for elections. It attributes the imbalance to sorting via activities on or before Election Day rather than post-election manipulation.
Critiques: Randomized Experiments from Non-random Selection in U.S. House Elections
identificationmethodsstatisticsreproducibilitygeneralisationCrossref-verified: DOI resolves in Crossref to this exact title in Political Analysis (2011), indexed as journal-article.
- Tier SPolitical science (voting behavior / retrospective voting) · 2018
Do Shark Attacks Influence Presidential Elections? Reassessing a Prominent Finding on Voter Competence
Anthony Fowler, Andrew B. Hall · The Journal of Politics · 2018
Reanalyzing Achen and Bartels's claim that 1916 New Jersey shark attacks cost Woodrow Wilson roughly ten points in beach communities, it finds the county-level effect shrinks and weakens under alternative specifications and the town-level Ocean County result largely vanishes once coding errors are corrected. It concludes there is little compelling evidence that shark attacks influenced the election, casting doubt on this prominent 'blind retrospection' demonstration of voter incompetence.
statisticsdata_codeclaimsoverclaimingreproducibilityCrossref-verified: DOI resolves in Crossref to this exact title in The Journal of Politics (2018), indexed as journal-article.
- Tier SFinance (asset pricing / mutual fund performance) · 2019
Reassessing False Discoveries in Mutual Fund Performance: Skill, Luck, or Lack of Power?
Angie Andrikogiannopoulou, Filippos Papakonstantinou · The Journal of Finance · 2019
Reanalyzes the false-discovery-rate (FDR) method that Barras, Scaillet, and Wermers (2010) use to separate skilled, zero-alpha, and unskilled mutual funds. Andrikogiannopoulou and Papakonstantinou show via simulation that the FDR estimator is severely biased and underpowered at empirically relevant sample sizes, drastically overstating the fraction of zero-alpha funds and understating the proportion of skilled and unskilled funds.
Critiques: False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas — Laurent Barras, Olivier Scaillet, Russ Wermers
statisticsmethodsreproducibilityclaimsCrossref-verified: DOI resolves in Crossref to this exact title in The Journal of Finance (2019), indexed as journal-article.
Critical commentary5
A critical commentary that challenges the target's framing, inference, or generalisation without a formal Comment slot.
- Tier SSociology / computational social science · 2020AI-related target
What failure to predict life outcomes can teach us
Filiz Garip · Proceedings of the National Academy of Sciences · 2020
An invited PNAS commentary on Salganik et al.'s Fragile Families Challenge, arguing that the mass-collaboration finding that machine-learning models barely beat a simple benchmark exposes real limits of predictive ML in social science, and that the value lies in the common-task framework and out-of-sample testing rather than in any individual model's accuracy. It reframes the celebrated ML exercise as evidence of how little predictive purchase rich data plus ML actually buys for individual life outcomes.
Critiques: Measuring the predictability of life outcomes with a scientific mass collaboration
methodsclaimsoverclaiminggeneralisationreproducibilityCrossref-verified: DOI resolves in Crossref to this exact title in Proceedings of the National Academy of Sciences (2020), indexed as journal-article.
- Tier SSociology (organizational ecology) · 1991
Density Dependence in Organizational Mortality: Legitimacy or Unobserved Heterogeneity?
Trond Petersen, Kenneth W. Koput · American Sociological Review · 1991
Challenges the standard interpretation of density-dependence tests in organizational ecology, arguing the observed negative first-order effect of organizational density on mortality rates is equally consistent with unobserved heterogeneity (selection) rather than the theorized legitimation process, undermining the causal-theoretical reading of Hannan and Carroll's models.
Critiques: Density Dependence in the Evolution of Populations of Newspaper Organizations — Glenn R. Carroll, Michael T. Hannan
statisticsidentificationmethodstheoryclaimsCrossref-verified: DOI resolves in Crossref to this exact title in American Sociological Review (1991), indexed as journal-article.
- Tier SPolitical science (political methodology / survey experiments) · 2022
What Do We Learn about Voter Preferences from Conjoint Experiments?
Scott F. Abramson, Korhan Kocak, Asya Magazinnik · American Journal of Political Science · 2022
It shows that the average marginal component effect (AMCE), the central estimand of conjoint experiments popularized by Hainmueller, Hopkins, and Yamamoto, is not well defined in terms of majority preferences: even with rational subjects a positive AMCE can point opposite to the true majority preference, so AMCEs do not license common claims about what voters prefer. It argues the estimand conflates direction and intensity of preferences across respondents.
theorymethodsclaimsoverclaimingstatisticsCrossref-verified: DOI resolves in Crossref to this exact title in American Journal of Political Science (2022), indexed as journal-article.
- Tier AEducation / Educational Psychology · 2018
What Shall We Do About Grit? A Critical Review of What We Know and What We Don't Know
Marcus Credé · Educational Researcher · 2018
Critically reviews the grit literature popularized by Angela Duckworth, arguing the empirical evidence does not justify combining passion and perseverance into a single construct, that grit predicts academic performance only weakly (and no better than conscientiousness, a jangle-fallacy concern), and that there is no evidence grit interventions work.
Critiques: Grit: Perseverance and Passion for Long-Term Goals (Duckworth, Peterson, Matthews, & Kelly)
statisticsclaimsoverclaimingtheorynoveltyCrossref-verified: DOI resolves in Crossref to this exact title in Educational Researcher (2018), indexed as journal-article.
- Tier SCriminology · 2010
Poverty, Infant Mortality, and Homicide Rates in Cross-National Perspective: Assessments of Criterion and Construct Validity
Steven F. Messner, Lawrence E. Raffalovich, Gretchen M. Sutton · Criminology · 2010
Responds to Pridemore's critique of using infant mortality as a proxy for poverty in cross-national homicide research. Rather than re-running his analysis, the authors assemble a new 16-nation panel (1993-2000) with direct income-based poverty measures and find infant mortality correlates more strongly with relative than absolute poverty, arguing disadvantage is best treated as a multidimensional construct — a qualified, collaborative response rather than a wholesale rejection of the proxy.
Critiques: A Methodological Addition to the Cross-National Empirical Literature on Social Structure and Homicide: A First Test of the Poverty-Homicide Thesis — William Alex Pridemore
statisticsmethodsclaimsCrossref-verified: DOI resolves in Crossref to this exact title in Criminology (2010), indexed as journal-article.
How these calibrate Critical AI
Each benchmark is tagged with the critique dimensionsit exercises — identification, statistics, reproducibility, overclaiming, generalisation, and so on. Those are the same dimensions Critical AI’s own critique pipeline works through. A published Comment that overturns a headline result by re-running its regressions is the concrete standard our critiques are measured against: specific, sourced, and falsifiable, never a verdict on motive. The corpus skews toward the most-cited exemplars of the genre so the bar is set high.
Every DOI here resolved through Crossref at ingestion. The benchmark set grows as more verified exemplars are added; the machine-readable list is at /critique/api/benchmarks.