{"$schema":"https://policywindow.org/critique/api/schema","name":"Critical AI — critique benchmarks","description":"Real, published human-expert critiques (Comments, Replies, replications, critical commentaries, reanalyses) in top-tier social-science journals. Every DOI is independently Crossref-verified. Used as calibration benchmarks for Critical AI's AI-native critiques.","docs":"https://policywindow.org/critique/benchmarks","coverage":{"total":68,"aiRelated":21,"venues":48,"fields":61,"dimensions":10},"count":68,"benchmarks":[{"id":"herndon-ash-pollin-reinhart-rogoff","title":"Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff","authors":["Thomas Herndon","Michael Ash","Robert Pollin"],"venue":"Cambridge Journal of Economics","tier":"A","year":"2013","doi":"10.1093/cje/bet075","critiqueType":"replication","target":{"title":"Growth in a Time of Debt","authors":["Carmen M. Reinhart","Kenneth S. Rogoff"],"doi":"10.1257/aer.100.2.573","year":""},"whatItChallenges":"Replicating Reinhart and Rogoff's claim that public debt above 90% of GDP is associated with sharply lower growth, the authors find a spreadsheet coding error, selective exclusion of available country-year data, and unconventional weighting. Corrected, average real growth for high-debt countries is +2.2%, not the published -0.1%, eliminating the supposed debt threshold.","dimensions":["methods","data_code","reproducibility","statistics","claims","overclaiming"],"aiRelated":false,"field":"Macroeconomics / Public Finance","verifyNote":"DOI resolves in Crossref to this exact title in Cambridge Journal of Economics (2013), indexed as journal-article.","doi_url":"https://doi.org/10.1093/cje/bet075","target_doi_url":"https://doi.org/10.1257/aer.100.2.573"},{"id":"foote-goetz-abortion-crime","title":"The Impact of Legalized Abortion on Crime: Comment","authors":["Christopher L. Foote","Christopher F. Goetz"],"venue":"Quarterly Journal of Economics","tier":"S","year":"2008","doi":"10.1162/qjec.2008.123.1.407","critiqueType":"comment","target":{"title":"The Impact of Legalized Abortion on Crime","authors":["John J. Donohue III","Steven D. Levitt"],"doi":"10.1162/00335530151144050","year":""},"whatItChallenges":"The comment identifies a coding mistake in the within-state cohort regressions of Donohue and Levitt's abortion-crime paper and shows that correcting it and using a per-capita crime specification sharply weakens the results. It also shows the cross-state tests are not robust to allowing differential state trends.","dimensions":["methods","identification","data_code","statistics","claims","reproducibility"],"aiRelated":false,"field":"Applied Microeconomics / Crime","verifyNote":"DOI resolves in Crossref to this exact title in Quarterly Journal of Economics (2008), indexed as journal-article.","doi_url":"https://doi.org/10.1162/qjec.2008.123.1.407","target_doi_url":"https://doi.org/10.1162/00335530151144050"},{"id":"albouy-colonial-origins","title":"The Colonial Origins of Comparative Development: An Empirical Investigation: Comment","authors":["David Y. Albouy"],"venue":"American Economic Review","tier":"S","year":"2012","doi":"10.1257/aer.102.6.3059","critiqueType":"comment","target":{"title":"The Colonial Origins of Comparative Development: An Empirical Investigation","authors":["Daron Acemoglu","Simon Johnson","James A. Robinson"],"doi":"10.1257/aer.91.5.1369","year":""},"whatItChallenges":"Albouy shows that 36 of the 64 countries are assigned settler-mortality rates borrowed from other countries and that incomparable rates from laborers, bishops, and soldiers on campaign are combined in ways favorable to the institutions hypothesis. Once data problems are addressed, the mortality-expropriation relationship and the instrumental-variable estimates lose robustness, often yielding effectively infinite confidence intervals.","dimensions":["data_code","identification","methods","statistics","claims","generalisation"],"aiRelated":false,"field":"Development Economics / Institutions","verifyNote":"DOI resolves in Crossref to this exact title in American Economic Review (2012), indexed as journal-article.","doi_url":"https://doi.org/10.1257/aer.102.6.3059","target_doi_url":"https://doi.org/10.1257/aer.91.5.1369"},{"id":"gilbert-king-reproducibility-psychology","title":"Comment on \"Estimating the reproducibility of psychological science\"","authors":["Daniel T. Gilbert","Gary King","Stephen Pettigrew","Timothy D. Wilson"],"venue":"Science","tier":"S","year":"2016","doi":"10.1126/science.aad7243","critiqueType":"comment","target":{"title":"Estimating the reproducibility of psychological science","authors":["Open Science Collaboration"],"doi":"10.1126/science.aac4716","year":""},"whatItChallenges":"Argues the Open Science Collaboration's Reproducibility Project contains three statistical errors (low-power replications, non-representative study sampling, and misleading endorsement criteria) that bias the reported replication rate downward. Concludes the data are actually consistent with very high reproducibility, not the low rate the original claimed.","dimensions":["statistics","methods","identification","claims","reproducibility","generalisation"],"aiRelated":false,"field":"Psychology / metascience","verifyNote":"DOI resolves in Crossref to this exact title in Science (2016), indexed as journal-article.","doi_url":"https://doi.org/10.1126/science.aad7243","target_doi_url":"https://doi.org/10.1126/science.aac4716"},{"id":"hagger-ego-depletion-rrr","title":"A Multilab Preregistered Replication of the Ego-Depletion Effect","authors":["Martin S. Hagger","Nikos L. D. Chatzisarantis","Hugo Alberts","Calvin Octavianus Anggono"],"venue":"Perspectives on Psychological Science","tier":"A","year":"2016","doi":"10.1177/1745691616652873","critiqueType":"replication","target":{"title":"Methylphenidate Blocks Effort-Induced Depletion of Regulatory Control in Healthy Volunteers","authors":["Chandra Sripada","Daniel Kessler","John Jonides"],"doi":"10.1177/0956797614526415","year":""},"whatItChallenges":"A preregistered Registered Replication Report across 23 labs (~2000 participants) tested the sequential-task ego-depletion effect and found a meta-analytic effect indistinguishable from zero (d ≈ 0.04). Challenges the existence and robustness of the widely cited ego-depletion / limited-resource model of self-control.","dimensions":["reproducibility","methods","statistics","claims","generalisation"],"aiRelated":false,"field":"Social/cognitive psychology (self-control)","verifyNote":"DOI resolves in Crossref to this exact title in Perspectives on Psychological Science (2016), indexed as journal-article.","doi_url":"https://doi.org/10.1177/1745691616652873","target_doi_url":"https://doi.org/10.1177/0956797614526415"},{"id":"wagenmakers-facial-feedback-rrr","title":"Registered Replication Report: Strack, Martin, & Stepper (1988)","authors":["E.-J. Wagenmakers","Titia Beek","Laura Dijkhoff","Quentin F. Gronau"],"venue":"Perspectives on Psychological Science","tier":"A","year":"2016","doi":"10.1177/1745691616674458","critiqueType":"replication","target":{"title":"Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis","authors":["Fritz Strack","Leonard L. Martin","Sabine Stepper"],"doi":"10.1037/0022-3514.54.5.768","year":""},"whatItChallenges":"A Registered Replication Report of 17 direct replications of the classic pen-in-mouth facial-feedback study found a pooled effect of 0.03 rating units (95% CI -0.11 to 0.16) versus the original 0.82, failing to replicate the claim that induced smiling increases rated funniness. Challenges a textbook facial-feedback finding.","dimensions":["reproducibility","methods","statistics","claims","theory"],"aiRelated":false,"field":"Social psychology / emotion","verifyNote":"DOI resolves in Crossref to this exact title in Perspectives on Psychological Science (2016), indexed as journal-article.","doi_url":"https://doi.org/10.1177/1745691616674458","target_doi_url":"https://doi.org/10.1037/0022-3514.54.5.768"},{"id":"ranehill-power-posing","title":"Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women","authors":["Eva Ranehill","Anna Dreber","Magnus Johannesson","Susanne Leiberg","Sunhae Sul","Roberto A. Weber"],"venue":"Psychological Science","tier":"S","year":"2015","doi":"10.1177/0956797614553946","critiqueType":"replication","target":{"title":"Power Posing: Brief Nonverbal Displays Affect Neuroendocrine Levels and Risk Tolerance","authors":["Dana R. Carney","Amy J. C. Cuddy","Andy J. Yap"],"doi":"10.1177/0956797610383437","year":""},"whatItChallenges":"A larger, better-powered replication (N=200) of the power-posing study replicated only self-reported feelings of power but found no effect of expansive postures on testosterone, cortisol, or behavioral risk tolerance. Challenges the central physiological and behavioral claims of the original power-posing paper.","dimensions":["reproducibility","methods","statistics","claims","overclaiming"],"aiRelated":false,"field":"Social psychology / embodied cognition","verifyNote":"DOI resolves in Crossref to this exact title in Psychological Science (2015), indexed as journal-article.","doi_url":"https://doi.org/10.1177/0956797614553946","target_doi_url":"https://doi.org/10.1177/0956797610383437"},{"id":"klein-many-labs-2","title":"Many Labs 2: Investigating Variation in Replicability Across Samples and Settings","authors":["Richard A. Klein","Michelangelo Vianello","Fred Hasselman"],"venue":"Advances in Methods and Practices in Psychological Science","tier":"A","year":"2018","doi":"10.1177/2515245918810225","critiqueType":"replication","target":{"title":"28 classic and contemporary psychological findings (multi-target replication, e.g. Tversky & Kahneman framing, Schwarz heuristics, moral-judgment effects)","authors":["Various original authors"],"doi":null,"year":""},"whatItChallenges":"A large preregistered multi-site project replicated 28 published effects across 60+ samples and ~15,000 participants; only about half replicated robustly and variation across samples/settings was generally small, implying non-replication reflects original effects rather than hidden moderators. Challenges the robustness and breadth of numerous canonical findings.","dimensions":["reproducibility","methods","statistics","generalisation","claims"],"aiRelated":false,"field":"Psychology / metascience","verifyNote":"DOI resolves in Crossref to this exact title in Advances in Methods and Practices in Psychological Science (2018), indexed as journal-article.","doi_url":"https://doi.org/10.1177/2515245918810225","target_doi_url":null},{"id":"quigley-hambrick-ceo-effect-comment","title":"Reaffirming the CEO Effect Is Significant and Much Larger than Chance: A Comment on Fitza (2014)","authors":["Timothy J. Quigley","Scott D. Graffin"],"venue":"Strategic Management Journal","tier":"S","year":"2016","doi":"10.1002/smj.2503","critiqueType":"comment","target":{"title":"The use of variance decomposition in the investigation of CEO effects: How large must the CEO effect be to rule out chance?","authors":["Markus A. Fitza"],"doi":"10.1002/smj.2192","year":""},"whatItChallenges":"Challenges Fitza's (2014) claim that the estimated 'CEO effect' on firm performance is almost entirely an artifact of random chance, arguing his simulation/variance-decomposition approach mis-specifies the chance baseline. Using corrected methods, they conclude the CEO effect is statistically significant and substantively much larger than chance.","dimensions":["methods","identification","statistics","reproducibility"],"aiRelated":false,"field":"Strategic Management / Organization","verifyNote":"DOI resolves in Crossref to this exact title in Strategic Management Journal (2016), indexed as journal-article.","doi_url":"https://doi.org/10.1002/smj.2503","target_doi_url":"https://doi.org/10.1002/smj.2192"},{"id":"fitza-ceo-effect-rejoinder","title":"How Much Do CEOs Really Matter? Reaffirming That the CEO Effect Is Mostly Due to Chance","authors":["Markus A. Fitza"],"venue":"Strategic Management Journal","tier":"S","year":"2016","doi":"10.1002/smj.2597","critiqueType":"rejoinder","target":{"title":"Reaffirming the CEO Effect Is Significant and Much Larger than Chance: A Comment on Fitza (2014) (Quigley & Graffin 2017)","authors":[],"doi":"10.1002/smj.2503","year":""},"whatItChallenges":"Rejoinder defending the original conclusion against Quigley and Graffin's comment, arguing that once more realistic assumptions about how chance affects firm performance are imposed, the apparent CEO effect is statistically indistinguishable from chance regardless of the estimation methodology used.","dimensions":["methods","identification","statistics","claims"],"aiRelated":false,"field":"Strategic Management / Organization","verifyNote":"DOI resolves in Crossref to this exact title in Strategic Management Journal (2016), indexed as journal-article.","doi_url":"https://doi.org/10.1002/smj.2597","target_doi_url":"https://doi.org/10.1002/smj.2503"},{"id":"quigley-ceo-in-context-replication","title":"The \"CEO in Context\" Technique Revisited: A Replication and Extension of Hambrick and Quigley (2014)","authors":["Tobias Keller","Martin Glaum","Andreas Bausch","Thorsten Bunz"],"venue":"Strategic Management Journal","tier":"S","year":"2022","doi":"10.1002/smj.3453","critiqueType":"replication","target":{"title":"Toward More Accurate Contextualization of the CEO Effect on Firm Performance (Hambrick & Quigley 2014)","authors":[],"doi":"10.1002/smj.2108","year":""},"whatItChallenges":"Replicates and extends the 'CEO in Context' technique on a far larger sample (33,996 firm-years vs 4,866) and broadly CONFIRMS the original's high CEO effect — attributing about a third of the variance in firm performance (ROA) to the CEO — while showing the estimate shrinks under an adjusted-R² specification, a within-paper robustness nuance rather than an overturning of the headline finding.","dimensions":["methods","statistics","reproducibility","generalisation","identification"],"aiRelated":false,"field":"Strategic Management / Organization","verifyNote":"DOI resolves in Crossref to this exact title in Strategic Management Journal (2022), indexed as journal-article.","doi_url":"https://doi.org/10.1002/smj.3453","target_doi_url":"https://doi.org/10.1002/smj.2108"},{"id":"garip-fragile-families-prediction","title":"What failure to predict life outcomes can teach us","authors":["Filiz Garip"],"venue":"Proceedings of the National Academy of Sciences","tier":"S","year":"2020","doi":"10.1073/pnas.2003390117","critiqueType":"critical_commentary","target":{"title":"Measuring the predictability of life outcomes with a scientific mass collaboration","authors":[],"doi":"10.1073/pnas.1915006117","year":""},"whatItChallenges":"An invited PNAS commentary on Salganik et al.'s Fragile Families Challenge, arguing that the mass-collaboration finding that machine-learning models barely beat a simple benchmark exposes real limits of predictive ML in social science, and that the value lies in the common-task framework and out-of-sample testing rather than in any individual model's accuracy. It reframes the celebrated ML exercise as evidence of how little predictive purchase rich data plus ML actually buys for individual life outcomes.","dimensions":["methods","claims","overclaiming","generalisation","reproducibility"],"aiRelated":true,"field":"Sociology / computational social science","verifyNote":"DOI resolves in Crossref to this exact title in Proceedings of the National Academy of Sciences (2020), indexed as journal-article.","doi_url":"https://doi.org/10.1073/pnas.2003390117","target_doi_url":"https://doi.org/10.1073/pnas.1915006117"},{"id":"dressel-farid-compas","title":"The accuracy, fairness, and limits of predicting recidivism","authors":["Julia Dressel","Hany Farid"],"venue":"Science Advances","tier":"A","year":"2018","doi":"10.1126/sciadv.aao5580","critiqueType":"reanalysis","target":{"title":"Evaluating the predictive validity of the COMPAS Risk and Needs Assessment System (Northpointe/COMPAS recidivism risk tool)","authors":[],"doi":"10.1177/0093854808326545","year":""},"whatItChallenges":"A widely cited reanalysis showing that the commercial COMPAS recidivism risk algorithm (137 features) is no more accurate or fair than predictions from untrained humans on Mechanical Turk (62% vs 65%), and that a simple two-feature linear classifier matches COMPAS's accuracy. It directly challenges claims that proprietary ML risk-assessment tools provide superior, sophisticated predictive power over simple baselines.","dimensions":["statistics","methods","claims","overclaiming","novelty"],"aiRelated":true,"field":"Criminology / algorithmic risk assessment","verifyNote":"DOI resolves in Crossref to this exact title in Science Advances (2018), indexed as journal-article.","doi_url":"https://doi.org/10.1126/sciadv.aao5580","target_doi_url":"https://doi.org/10.1177/0093854808326545"},{"id":"kapoor-narayanan-ml-leakage","title":"Leakage and the reproducibility crisis in machine-learning-based science","authors":["Sayash Kapoor","Arvind Narayanan"],"venue":"Patterns","tier":"A","year":"2023","doi":"10.1016/j.patter.2023.100804","critiqueType":"reanalysis","target":{"title":"ML-based civil-war / armed-conflict prediction studies claiming complex ML outperforms logistic regression (e.g., Colaresi & Mahmood and the systematic review of conflict-forecasting papers)","authors":[],"doi":null,"year":""},"whatItChallenges":"A reproducibility audit identifying data leakage as a pervasive failure mode across 294 ML-based-science papers in 17 fields; its central social-science case study reproduces civil-war prediction papers and shows that, after correcting for leakage, complex ML models do not outperform decades-old logistic regression, overturning published claims of ML superiority. It challenges overclaimed ML performance and proposes model info sheets as a remedy.","dimensions":["reproducibility","data_code","methods","statistics","overclaiming","novelty"],"aiRelated":true,"field":"Political science / computational social science (ML methodology)","verifyNote":"DOI resolves in Crossref to this exact title in Patterns (2023), indexed as journal-article.","doi_url":"https://doi.org/10.1016/j.patter.2023.100804","target_doi_url":null},{"id":"green-palmquist-schickler-macropartisanship","title":"Macropartisanship: A Replication and Critique","authors":["Donald Green","Bradley Palmquist","Eric Schickler"],"venue":"American Political Science Review","tier":"S","year":"1998","doi":"10.2307/2586310","critiqueType":"replication","target":{"title":"Macropartisanship","authors":["Michael B. MacKuen","Robert S. Erikson","James A. Stimson"],"doi":"10.2307/1961661","year":""},"whatItChallenges":"Replicates MacKuen, Erikson, and Stimson's claim that aggregate party identification swings substantially in response to short-term shocks like consumer sentiment and presidential approval. Using more extensive survey data and correcting for measurement error, finds the short-term partisan movement is two to three times smaller than originally reported, supporting a stable, slow-adjusting view of partisanship.","dimensions":["statistics","methods","reproducibility","claims","overclaiming"],"aiRelated":false,"field":"Political Science (public opinion / political behavior)","verifyNote":"DOI resolves in Crossref to this exact title in American Political Science Review (1998), indexed as journal-article.","doi_url":"https://doi.org/10.2307/2586310","target_doi_url":"https://doi.org/10.2307/1961661"},{"id":"imai-gotv-turnout","title":"Do Get-Out-the-Vote Calls Reduce Turnout? The Importance of Statistical Methods for Field Experiments","authors":["Kosuke Imai"],"venue":"American Political Science Review","tier":"S","year":"2005","doi":"10.1017/s0003055405051658","critiqueType":"reanalysis","target":{"title":"The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout: A Field Experiment (Gerber & Green, 2000)","authors":[],"doi":"10.2307/2585837","year":""},"whatItChallenges":"Reanalyzes Gerber and Green's influential New Haven GOTV field experiment and argues the implemented treatment and control groups were not balanced as a randomized design requires; applying matching and corrected statistical methods, claims that phone calls in fact produced large positive turnout effects, contradicting the original null result and highlighting the consequences of statistical/computational choices in experiments.","dimensions":["identification","statistics","methods","data_code","reproducibility","claims"],"aiRelated":false,"field":"Political Science (experimental methods / voter mobilization)","verifyNote":"DOI resolves in Crossref to this exact title in American Political Science Review (2005), indexed as journal-article.","doi_url":"https://doi.org/10.1017/s0003055405051658","target_doi_url":"https://doi.org/10.2307/2585837"},{"id":"gerber-green-gotv-rejoinder","title":"Correction to Gerber and Green (2000), Replication of Disputed Findings, and Reply to Imai (2005)","authors":["Alan S. Gerber","Donald P. Green"],"venue":"American Political Science Review","tier":"S","year":"2005","doi":"10.1017/s000305540505166x","critiqueType":"rejoinder","target":{"title":"Do Get-Out-the-Vote Calls Reduce Turnout? The Importance of Statistical Methods for Field Experiments (Imai, 2005)","authors":[],"doi":"10.1017/s0003055405051658","year":""},"whatItChallenges":"Responds to Imai's (2005) reanalysis: acknowledges and repairs data-processing errors in the original 2000 article, then argues Imai's correction itself contains statistical, computational, and reporting errors that invalidate its conclusions. After fixes, the original substantive finding stands that brief phone calls do not meaningfully increase voter turnout.","dimensions":["statistics","methods","data_code","reproducibility","claims"],"aiRelated":false,"field":"Political Science (experimental methods / voter mobilization)","verifyNote":"DOI resolves in Crossref to this exact title in American Political Science Review (2005), indexed as journal-article.","doi_url":"https://doi.org/10.1017/s000305540505166x","target_doi_url":"https://doi.org/10.1017/s0003055405051658"},{"id":"petersen-koput-density-dependence","title":"Density Dependence in Organizational Mortality: Legitimacy or Unobserved Heterogeneity?","authors":["Trond Petersen","Kenneth W. Koput"],"venue":"American Sociological Review","tier":"S","year":"1991","doi":"10.2307/2096112","critiqueType":"critical_commentary","target":{"title":"Density Dependence in the Evolution of Populations of Newspaper Organizations","authors":["Glenn R. Carroll","Michael T. Hannan"],"doi":"10.2307/2095875","year":""},"whatItChallenges":"Challenges the standard interpretation of density-dependence tests in organizational ecology, arguing the observed negative first-order effect of organizational density on mortality rates is equally consistent with unobserved heterogeneity (selection) rather than the theorized legitimation process, undermining the causal-theoretical reading of Hannan and Carroll's models.","dimensions":["statistics","identification","methods","theory","claims"],"aiRelated":false,"field":"Sociology (organizational ecology)","verifyNote":"DOI resolves in Crossref to this exact title in American Sociological Review (1991), indexed as journal-article.","doi_url":"https://doi.org/10.2307/2096112","target_doi_url":"https://doi.org/10.2307/2095875"},{"id":"cheng-measurement-methods-divergent","title":"Measurement, methods, and divergent patterns: Reassessing the effects of same-sex parents","authors":["Simon Cheng","Brian Powell"],"venue":"Social Science Research","tier":"A","year":"2015","doi":"10.1016/j.ssresearch.2015.04.005","critiqueType":"reanalysis","target":{"title":"How different are the adult children of parents who have same-sex relationships? Findings from the New Family Structures Study","authors":[],"doi":"10.1016/j.ssresearch.2012.03.009","year":""},"whatItChallenges":"Reanalyzes Regnerus's (2012) New Family Structures Study and shows his negative findings for the adult children of parents who had a same-sex relationship are fragile. At least a third to two-fifths of the 236 same-sex-parent cases are misclassified, and the results further hinge on contested measurement and coding choices (outcome recoding, the comparison category, sociodemographic controls, multiple imputation). Correcting the misclassification and these choices renders most of the associations statistically insignificant.","dimensions":["data_code","methods","statistics","reproducibility","claims","overclaiming"],"aiRelated":false,"field":"Sociology (family/demography)","verifyNote":"DOI resolves in Crossref to this exact title in Social Science Research (2015), indexed as journal-article.","doi_url":"https://doi.org/10.1016/j.ssresearch.2015.04.005","target_doi_url":"https://doi.org/10.1016/j.ssresearch.2012.03.009"},{"id":"alba-commentary-kids-mostly","title":"Commentary: The Kids Are (Mostly) Alright: Second-Generation Assimilation: Comments on Haller, Portes and Lynch","authors":["Richard Alba","Philip Kasinitz","Mary C. Waters"],"venue":"Social Forces","tier":"A","year":"2011","doi":"10.1093/sf/89.3.763","critiqueType":"comment","target":{"title":"Dreams Fulfilled, Dreams Shattered: Determinants of Segmented Assimilation in the Second Generation","authors":[],"doi":"10.1353/sof.2011.0003","year":""},"whatItChallenges":"Challenges Haller, Portes and Lynch's pessimistic 'segmented assimilation / downward assimilation' thesis about the immigrant second generation, arguing that their data, model specification and interpretation overstate the prevalence and inevitability of downward mobility, and that the bulk of second-generation outcomes are in fact reasonably positive.","dimensions":["claims","theory","methods","generalisation","overclaiming"],"aiRelated":false,"field":"Sociology (immigration/assimilation)","verifyNote":"DOI resolves in Crossref to this exact title in Social Forces (2011), indexed as journal-article.","doi_url":"https://doi.org/10.1093/sf/89.3.763","target_doi_url":"https://doi.org/10.1353/sof.2011.0003"},{"id":"clampet-lundquist-neighborhood-effects-economic","title":"Neighborhood Effects on Economic Self-Sufficiency: A Reconsideration of the Moving to Opportunity Experiment","authors":["Susan Clampet-Lundquist","Douglas S. Massey"],"venue":"American Journal of Sociology","tier":"S","year":"2008","doi":"10.1086/588740","critiqueType":"reanalysis","target":{"title":"Moving to Opportunity / experimental analyses concluding null neighborhood effects on economic self-sufficiency (Kling, Liebman, and Katz; MTO interim impacts evaluation)","authors":[],"doi":null,"year":""},"whatItChallenges":"Reconsiders the influential Moving to Opportunity housing-voucher experiment's conclusion of null neighborhood effects on adult economic self-sufficiency, arguing that the intention-to-treat design and treatment definition mask real effects; using duration and quality of neighborhood exposure, the authors find evidence that sustained exposure to lower-poverty neighborhoods does improve economic outcomes.","dimensions":["identification","methods","claims","statistics","reproducibility"],"aiRelated":false,"field":"Sociology (neighborhood effects / urban poverty)","verifyNote":"DOI resolves in Crossref to this exact title in American Journal of Sociology (2008), indexed as journal-article.","doi_url":"https://doi.org/10.1086/588740","target_doi_url":null},{"id":"breznau-missing-main-effect","title":"The Missing Main Effect of Welfare State Regimes: A Replication of 'Social Policy Responsiveness in Developed Democracies' by Brooks and Manza","authors":["Nate Breznau"],"venue":"Sociological Science","tier":"A","year":"2015","doi":"10.15195/v2.a20","critiqueType":"replication","target":{"title":"Social Policy Responsiveness in Developed Democracies","authors":[],"doi":"10.1177/000312240607100306","year":""},"whatItChallenges":"Replicates Brooks and Manza's (2006, ASR) claim that public opinion drives welfare-state spending and finds it rests on a model specification error: they included an opinion-by-welfare-regime interaction while omitting the main effect of welfare regime; restoring the missing main effect across more than 800 model configurations eliminates the original finding in roughly 99.5% of cases.","dimensions":["statistics","methods","reproducibility","data_code","claims"],"aiRelated":false,"field":"Sociology (political sociology / welfare state)","verifyNote":"DOI resolves in Crossref to this exact title in Sociological Science (2015), indexed as journal-article.","doi_url":"https://doi.org/10.15195/v2.a20","target_doi_url":"https://doi.org/10.1177/000312240607100306"},{"id":"caughey-elections-regression-discontinuity","title":"Elections and the Regression Discontinuity Design: Lessons from Close U.S. House Races, 1942-2008","authors":["Devin Caughey","Jasjeet S. Sekhon"],"venue":"Political Analysis","tier":"A","year":"2011","doi":"10.1093/pan/mpr032","critiqueType":"reanalysis","target":{"title":"Randomized Experiments from Non-random Selection in U.S. House Elections","authors":[],"doi":"10.1016/j.jeconom.2007.05.004","year":""},"whatItChallenges":"Replicating close U.S. House races, it shows bare winners and bare losers differ markedly on pretreatment covariates (financial, experience, and incumbency advantages), undermining the as-if-random assumption underpinning Lee-style regression discontinuity designs for elections. It attributes the imbalance to sorting via activities on or before Election Day rather than post-election manipulation.","dimensions":["identification","methods","statistics","reproducibility","generalisation"],"aiRelated":false,"field":"Political science (political methodology / causal inference)","verifyNote":"DOI resolves in Crossref to this exact title in Political Analysis (2011), indexed as journal-article.","doi_url":"https://doi.org/10.1093/pan/mpr032","target_doi_url":"https://doi.org/10.1016/j.jeconom.2007.05.004"},{"id":"eggers-validity-regression-discontinuity","title":"On the Validity of the Regression Discontinuity Design for Estimating Electoral Effects: New Evidence from Over 40,000 Close Races","authors":["Andrew C. Eggers","Anthony Fowler","Jens Hainmueller","Andrew B. Hall","James M. Snyder Jr."],"venue":"American Journal of Political Science","tier":"S","year":"2014","doi":"10.1111/ajps.12127","critiqueType":"replication","target":{"title":"Elections and the Regression Discontinuity Design: Lessons from Close U.S. House Races, 1942-2008","authors":[],"doi":"10.1093/pan/mpr032","year":""},"whatItChallenges":"Assembling over 40,000 close races across many electoral settings, it finds no systematic evidence of strategic sorting or covariate imbalance at the threshold, arguing the close-election RD design is generally valid and that the Caughey-Sekhon/Snyder imbalance is largely specific to postwar U.S. House races rather than a general flaw. It reframes the earlier critique as an unusual case rather than evidence against RD broadly.","dimensions":["identification","methods","statistics","reproducibility","generalisation","claims"],"aiRelated":false,"field":"Political science (political methodology / causal inference)","verifyNote":"DOI resolves in Crossref to this exact title in American Journal of Political Science (2014), indexed as journal-article.","doi_url":"https://doi.org/10.1111/ajps.12127","target_doi_url":"https://doi.org/10.1093/pan/mpr032"},{"id":"abramson-what-do-we","title":"What Do We Learn about Voter Preferences from Conjoint Experiments?","authors":["Scott F. Abramson","Korhan Kocak","Asya Magazinnik"],"venue":"American Journal of Political Science","tier":"S","year":"2022","doi":"10.1111/ajps.12714","critiqueType":"critical_commentary","target":{"title":"Causal Inference in Conjoint Analysis: Understanding Multidimensional Choices via Stated Preference Experiments","authors":[],"doi":"10.1093/pan/mpt024","year":""},"whatItChallenges":"It shows that the average marginal component effect (AMCE), the central estimand of conjoint experiments popularized by Hainmueller, Hopkins, and Yamamoto, is not well defined in terms of majority preferences: even with rational subjects a positive AMCE can point opposite to the true majority preference, so AMCEs do not license common claims about what voters prefer. It argues the estimand conflates direction and intensity of preferences across respondents.","dimensions":["theory","methods","claims","overclaiming","statistics"],"aiRelated":false,"field":"Political science (political methodology / survey experiments)","verifyNote":"DOI resolves in Crossref to this exact title in American Journal of Political Science (2022), indexed as journal-article.","doi_url":"https://doi.org/10.1111/ajps.12714","target_doi_url":"https://doi.org/10.1093/pan/mpt024"},{"id":"fowler-do-shark-attacks","title":"Do Shark Attacks Influence Presidential Elections? Reassessing a Prominent Finding on Voter Competence","authors":["Anthony Fowler","Andrew B. Hall"],"venue":"The Journal of Politics","tier":"S","year":"2018","doi":"10.1086/699244","critiqueType":"reanalysis","target":{"title":"Blind Retrospection: Electoral Responses to Drought, Flu, and Shark Attacks (in Democracy for Realists)","authors":[],"doi":null,"year":""},"whatItChallenges":"Reanalyzing Achen and Bartels's claim that 1916 New Jersey shark attacks cost Woodrow Wilson roughly ten points in beach communities, it finds the county-level effect shrinks and weakens under alternative specifications and the town-level Ocean County result largely vanishes once coding errors are corrected. It concludes there is little compelling evidence that shark attacks influenced the election, casting doubt on this prominent 'blind retrospection' demonstration of voter incompetence.","dimensions":["statistics","data_code","claims","overclaiming","reproducibility"],"aiRelated":false,"field":"Political science (voting behavior / retrospective voting)","verifyNote":"DOI resolves in Crossref to this exact title in The Journal of Politics (2018), indexed as journal-article.","doi_url":"https://doi.org/10.1086/699244","target_doi_url":null},{"id":"prior-challenge-measuring-media","title":"The Challenge of Measuring Media Exposure: Reply to Dilliplane, Goldman, and Mutz","authors":["Markus Prior"],"venue":"Political Communication","tier":"A","year":"2013","doi":"10.1080/10584609.2013.819539","critiqueType":"reply","target":{"title":"Televised Exposure to Politics: New Measures for a Fragmented Media Environment","authors":[],"doi":"10.1111/j.1540-5907.2012.00600.x","year":""},"whatItChallenges":"Prior critiques the program-list measure of televised political exposure that Dilliplane, Goldman, and Mutz proposed (and which the ANES adopted), arguing it has low construct validity because it never measures the amount of exposure and shows poor convergent validity by several criteria. He contends the measure conflates recall/recognition with exposure and overstates the predictive payoff of the new instrument.","dimensions":["methods","statistics","claims","generalisation"],"aiRelated":false,"field":"Political communication / media exposure measurement","verifyNote":"DOI resolves in Crossref to this exact title in Political Communication (2013), indexed as journal-article.","doi_url":"https://doi.org/10.1080/10584609.2013.819539","target_doi_url":"https://doi.org/10.1111/j.1540-5907.2012.00600.x"},{"id":"burton-reconsidering-evidence-moral","title":"Reconsidering evidence of moral contagion in online social networks","authors":["Jason W. Burton","Nicole Cruz","Ulrike Hahn"],"venue":"Nature Human Behaviour","tier":"S","year":"2021","doi":"10.1038/s41562-021-01133-5","critiqueType":"replication","target":{"title":"Emotion shapes the diffusion of moralized content in social networks","authors":[],"doi":"10.1073/pnas.1618923114","year":""},"whatItChallenges":"Re-tests Brady et al.'s (2017) 'moral contagion' method on six new Twitter corpora rather than reanalysing their data, and finds via out-of-sample prediction, model comparison and specification-curve analysis that the moral-contagion model performs no better than an implausibly-named 'XYZ contagion' placebo — challenging the strength of the original correlational claim while conceding moral contagion may still exist.","dimensions":["methods","statistics","identification","reproducibility","overclaiming","claims"],"aiRelated":true,"field":"Computational social science / social-media text analysis","verifyNote":"DOI resolves in Crossref to this exact title in Nature Human Behaviour (2021), indexed as journal-article.","doi_url":"https://doi.org/10.1038/s41562-021-01133-5","target_doi_url":"https://doi.org/10.1073/pnas.1618923114"},{"id":"pena-game-perspective-taking","title":"Game perspective-taking effects on willingness to help immigrants: A replication study with a Spanish sample","authors":["Jorge Peña","Juan Francisco Hernández Pérez"],"venue":"New Media & Society","tier":"A","year":"2019","doi":"10.1177/1461444819874472","critiqueType":"replication","target":{"title":"Game Perspective-Taking Effects on Players' Behavioral Intention, Attitudes, Subjective Norms, and Self-Efficacy to Help Immigrants: The Case of 'Papers, Please'","authors":[],"doi":"10.1089/cyber.2018.0030","year":""},"whatItChallenges":"A replication of a perspective-taking game study on willingness to help immigrants. The original reported reductions in behavioural intention, subjective norms and self-efficacy (attitudes were unaffected); the Spanish-sample replication reproduced the intention effect but not the subjective-norms or self-efficacy effects, while finding an attitude effect the original did not — partly corroborating and partly diverging from the original.","dimensions":["reproducibility","generalisation","claims","methods"],"aiRelated":false,"field":"New media / games studies / media effects","verifyNote":"DOI resolves in Crossref to this exact title in New Media & Society (2019), indexed as journal-article.","doi_url":"https://doi.org/10.1177/1461444819874472","target_doi_url":"https://doi.org/10.1089/cyber.2018.0030"},{"id":"crede-what-shall-we","title":"What Shall We Do About Grit? A Critical Review of What We Know and What We Don't Know","authors":["Marcus Credé"],"venue":"Educational Researcher","tier":"A","year":"2018","doi":"10.3102/0013189X18801322","critiqueType":"critical_commentary","target":{"title":"Grit: Perseverance and Passion for Long-Term Goals (Duckworth, Peterson, Matthews, & Kelly)","authors":[],"doi":"10.1037/0022-3514.92.6.1087","year":""},"whatItChallenges":"Critically reviews the grit literature popularized by Angela Duckworth, arguing the empirical evidence does not justify combining passion and perseverance into a single construct, that grit predicts academic performance only weakly (and no better than conscientiousness, a jangle-fallacy concern), and that there is no evidence grit interventions work.","dimensions":["statistics","claims","overclaiming","theory","novelty"],"aiRelated":false,"field":"Education / Educational Psychology","verifyNote":"DOI resolves in Crossref to this exact title in Educational Researcher (2018), indexed as journal-article.","doi_url":"https://doi.org/10.3102/0013189X18801322","target_doi_url":"https://doi.org/10.1037/0022-3514.92.6.1087"},{"id":"skiba-risks-consequences-oversimplifying","title":"Risks and Consequences of Oversimplifying Educational Inequities: A Response to Morgan et al. (2015)","authors":["Russell J. Skiba","Alfredo J. Artiles","Elizabeth B. Kozleski","Daniel J. Losen","Elizabeth G. Harry"],"venue":"Educational Researcher","tier":"A","year":"2016","doi":"10.3102/0013189X16644606","critiqueType":"comment","target":{"title":"Minorities Are Disproportionately Underrepresented in Special Education: Longitudinal Evidence Across Five Disability Conditions","authors":[],"doi":"10.3102/0013189X15591157","year":""},"whatItChallenges":"Directly challenges Morgan et al.'s widely-cited claim that racial/ethnic minorities are underrepresented (not overrepresented) in special education, arguing the conclusion is in error due to sampling and model-specification choices, the heavy covariate adjustment that conditions away the inequities of interest, and failure to engage the broader complexity of disproportionality.","dimensions":["methods","identification","statistics","claims","overclaiming","generalisation"],"aiRelated":false,"field":"Education / Special Education Policy","verifyNote":"DOI resolves in Crossref to this exact title in Educational Researcher (2016), indexed as journal-article.","doi_url":"https://doi.org/10.3102/0013189X16644606","target_doi_url":"https://doi.org/10.3102/0013189X15591157"},{"id":"benbow-rejoinder-critiques-national","title":"Rejoinder to the Critiques of the National Mathematics Advisory Panel Final Report","authors":["Camilla Persson Benbow","Larry R. Faulkner"],"venue":"Educational Researcher","tier":"A","year":"2008","doi":"10.3102/0013189X08329195","critiqueType":"rejoinder","target":{"title":"Critiques of the National Mathematics Advisory Panel Final Report (incl. Boaler, 'When Politics Took the Place of Inquiry,' and Kelly, 'Reflections on the National Mathematics Advisory Panel Final Report')","authors":[],"doi":null,"year":""},"whatItChallenges":"The Panel chair and co-chair rebut a cluster of published critiques (notably Boaler and Kelly) that attacked the Panel's restriction to randomized/quasi-experimental evidence on math curricula, instruction, and learning, defending the evidentiary standards and contesting claims that the report's methodology was inappropriate for educational field research.","dimensions":["methods","identification","claims","generalisation","theory"],"aiRelated":false,"field":"Education / Mathematics Education Policy","verifyNote":"DOI resolves in Crossref to this exact title in Educational Researcher (2008), indexed as journal-article.","doi_url":"https://doi.org/10.3102/0013189X08329195","target_doi_url":null},{"id":"rothstein-measuring-impacts-teachers","title":"Measuring the Impacts of Teachers: Comment","authors":["Jesse Rothstein"],"venue":"American Economic Review","tier":"S","year":"2017","doi":"10.1257/aer.20141440","critiqueType":"comment","target":{"title":"Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates","authors":["Raj Chetty","John N. Friedman","Jonah E. Rockoff"],"doi":"10.1257/aer.104.9.2593","year":""},"whatItChallenges":"Challenges Chetty, Friedman, and Rockoff's (2014) claim that the teacher-switching quasi-experiment shows student sorting creates negligible bias in teacher value-added (VA) scores. Rothstein shows teacher switching is correlated with changes in prior student preparedness, so the design is invalid; correcting for this reveals moderate VA bias (10-35% of teachers' causal effect variance) and shows long-run results are fragile to control choices.","dimensions":["identification","methods","statistics","reproducibility","claims"],"aiRelated":false,"field":"Economics (education / labor economics)","verifyNote":"DOI resolves in Crossref to this exact title in American Economic Review (2017), indexed as journal-article.","doi_url":"https://doi.org/10.1257/aer.20141440","target_doi_url":"https://doi.org/10.1257/aer.104.9.2593"},{"id":"neumark-minimum-wages-employment","title":"Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Comment","authors":["David Neumark","William Wascher"],"venue":"American Economic Review","tier":"S","year":"2000","doi":"10.1257/aer.90.5.1362","critiqueType":"comment","target":{"title":"Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania","authors":["David Card","Alan B. Krueger"],"doi":null,"year":""},"whatItChallenges":"Re-examines Card and Krueger's (1994) finding that a New Jersey minimum-wage increase did not reduce (and may have raised) fast-food employment. Using administrative payroll records rather than the original telephone-survey data, Neumark and Wascher find that employment fell after the minimum-wage rise, contradicting the original positive/zero estimate and attributing the discrepancy to measurement error in the survey data.","dimensions":["data_code","identification","methods","claims","reproducibility"],"aiRelated":false,"field":"Economics (labor economics)","verifyNote":"DOI resolves in Crossref to this exact title in American Economic Review (2000), indexed as journal-article.","doi_url":"https://doi.org/10.1257/aer.90.5.1362","target_doi_url":null},{"id":"andrikogiannopoulou-reassessing-false-discoveries","title":"Reassessing False Discoveries in Mutual Fund Performance: Skill, Luck, or Lack of Power?","authors":["Angie Andrikogiannopoulou","Filippos Papakonstantinou"],"venue":"The Journal of Finance","tier":"S","year":"2019","doi":"10.1111/jofi.12784","critiqueType":"reanalysis","target":{"title":"False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas","authors":["Laurent Barras","Olivier Scaillet","Russ Wermers"],"doi":"10.1111/j.1540-6261.2009.01527.x","year":""},"whatItChallenges":"Reanalyzes the false-discovery-rate (FDR) method that Barras, Scaillet, and Wermers (2010) use to separate skilled, zero-alpha, and unskilled mutual funds. Andrikogiannopoulou and Papakonstantinou show via simulation that the FDR estimator is severely biased and underpowered at empirically relevant sample sizes, drastically overstating the fraction of zero-alpha funds and understating the proportion of skilled and unskilled funds.","dimensions":["statistics","methods","reproducibility","claims"],"aiRelated":false,"field":"Finance (asset pricing / mutual fund performance)","verifyNote":"DOI resolves in Crossref to this exact title in The Journal of Finance (2019), indexed as journal-article.","doi_url":"https://doi.org/10.1111/jofi.12784","target_doi_url":"https://doi.org/10.1111/j.1540-6261.2009.01527.x"},{"id":"messner-poverty-infant-mortality","title":"Poverty, Infant Mortality, and Homicide Rates in Cross-National Perspective: Assessments of Criterion and Construct Validity","authors":["Steven F. Messner","Lawrence E. Raffalovich","Gretchen M. Sutton"],"venue":"Criminology","tier":"S","year":"2010","doi":"10.1111/j.1745-9125.2010.00194.x","critiqueType":"critical_commentary","target":{"title":"A Methodological Addition to the Cross-National Empirical Literature on Social Structure and Homicide: A First Test of the Poverty-Homicide Thesis","authors":["William Alex Pridemore"],"doi":"10.1111/j.1745-9125.2008.00106.x","year":""},"whatItChallenges":"Responds to Pridemore's critique of using infant mortality as a proxy for poverty in cross-national homicide research. Rather than re-running his analysis, the authors assemble a new 16-nation panel (1993-2000) with direct income-based poverty measures and find infant mortality correlates more strongly with relative than absolute poverty, arguing disadvantage is best treated as a multidimensional construct — a qualified, collaborative response rather than a wholesale rejection of the proxy.","dimensions":["statistics","methods","claims"],"aiRelated":false,"field":"Criminology","verifyNote":"DOI resolves in Crossref to this exact title in Criminology (2010), indexed as journal-article.","doi_url":"https://doi.org/10.1111/j.1745-9125.2010.00194.x","target_doi_url":"https://doi.org/10.1111/j.1745-9125.2008.00106.x"},{"id":"sievert-replication-representative-bureaucracy","title":"A replication of \"Representative bureaucracy and the willingness to coproduce\"","authors":["Martin Sievert"],"venue":"Public Administration","tier":"A","year":"2021","doi":"10.1111/padm.12743","critiqueType":"replication","target":{"title":"Representative Bureaucracy and the Willingness to Coproduce: An Experimental Study","authors":["Norma M. Riccucci","Gregg G. Van Ryzin","Huafang Li"],"doi":"10.1111/puar.12401","year":""},"whatItChallenges":"A wide replication, on new data, of Riccucci, Van Ryzin and Li's (2016) survey experiment on representative bureaucracy and citizens' willingness to coproduce, testing whether the original's representation effects hold in a different national context rather than re-running the original data.","dimensions":["reproducibility","statistics","methods","generalisation","claims"],"aiRelated":false,"field":"Public administration","verifyNote":"DOI resolves in Crossref to this exact title in Public Administration (2021), indexed as journal-article.","doi_url":"https://doi.org/10.1111/padm.12743","target_doi_url":"https://doi.org/10.1111/puar.12401"},{"id":"kleck-impossible-policy-evaluations","title":"Impossible Policy Evaluations and Impossible Conclusions: A Comment on Koper and Roth","authors":["Gary Kleck"],"venue":"Journal of Quantitative Criminology","tier":"A","year":"2001","doi":"10.1023/a:1007574415289","critiqueType":"comment","target":{"title":"The Impact of the 1994 Federal Assault Weapon Ban on Gun Violence Outcomes: An Assessment of Multiple Outcome Measures and Some Lessons for Policy Evaluation","authors":["Christopher S. Koper","Jeffrey A. Roth"],"doi":"10.1023/a:1007522431219","year":""},"whatItChallenges":"Argues that Koper and Roth's evaluation of the 1994 federal assault weapons ban could not, by design, detect any plausible effect because assault weapons figure in so few homicides, so the data lack statistical power to support any conclusion. Contends their tentative inference that the ban may have reduced gun homicides is essentially impossible to sustain from the evidence presented.","dimensions":["statistics","identification","methods","claims","overclaiming"],"aiRelated":false,"field":"Quantitative criminology / public policy","verifyNote":"DOI resolves in Crossref to this exact title in Journal of Quantitative Criminology (2001), indexed as journal-article.","doi_url":"https://doi.org/10.1023/a:1007574415289","target_doi_url":"https://doi.org/10.1023/a:1007522431219"},{"id":"greenberg-long-term-trends","title":"Long-Term Trends in Crimes of Violence (Comment on Cooney, 2003)","authors":["David F. Greenberg"],"venue":"Criminology","tier":"S","year":"2003","doi":"10.1111/j.1745-9125.2003.tb01024.x","critiqueType":"comment","target":{"title":"The Privatization of Violence","authors":["Mark Cooney"],"doi":"10.1111/j.1745-9125.2003.tb01023.x","year":""},"whatItChallenges":"Challenges Cooney's (2003) historical thesis that violence has been 'privatized' (shifting from elite/collective to marginal/individual actors), arguing the long-term empirical evidence on violent-crime trends and the social characteristics of offenders does not support the claimed qualitative transformation. Questions the selectivity and interpretation of the historical and anthropological evidence Cooney marshals.","dimensions":["theory","claims","methods","generalisation","data_code"],"aiRelated":false,"field":"Criminology","verifyNote":"DOI resolves in Crossref to this exact title in Criminology (2003), indexed as journal-article.","doi_url":"https://doi.org/10.1111/j.1745-9125.2003.tb01024.x","target_doi_url":"https://doi.org/10.1111/j.1745-9125.2003.tb01023.x"},{"id":"obermeyer-health-algo-bias","title":"Dissecting racial bias in an algorithm used to manage the health of populations","authors":["Ziad Obermeyer","Brian Powers","Christine Vogeli","Sendhil Mullainathan"],"venue":"Science","tier":"S","year":"2019","doi":"10.1126/science.aax2342","critiqueType":"reanalysis","target":{"title":"A widely deployed commercial population-health risk-prediction algorithm that uses health-care cost as a proxy for health need","authors":[],"doi":null,"year":""},"whatItChallenges":"The piece challenges a widely deployed commercial population-health algorithm that uses health-care cost as a proxy for health need, showing it exhibits significant racial bias: at any given risk score Black patients are sicker (more uncontrolled illness) than White patients, because unequal access means less money is spent on Black patients despite equal need. It concludes that remedying this would raise the share of Black patients flagged for extra help from 17.7% to 46.5%, and generalizes that choosing convenient proxies for ground truth (here, cost for illness) is an important and underappreciated source of algorithmic bias.","dimensions":["claims","data_code","generalisation","statistics"],"aiRelated":true,"field":"Machine learning fairness / health policy (algorithmic bias in healthcare)","verifyNote":"DOI resolves in Crossref to \"Dissecting racial bias in an algorithm used to manage the health of populations\" in Science (2019), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22).","doi_url":"https://doi.org/10.1126/science.aax2342","target_doi_url":null},{"id":"liang-gpt-detectors-biased","title":"GPT detectors are biased against non-native English writers","authors":["Weixin Liang","Mert Yuksekgonul","Yining Mao","Eric Wu","James Zou"],"venue":"Patterns","tier":"A","year":"2023","doi":"10.1016/j.patter.2023.100779","critiqueType":"critical_commentary","target":{"title":"Commercial and academic GPT/AI-text detectors claiming reliable detection of machine-generated text","authors":[],"doi":null,"year":""},"whatItChallenges":"Challenges the reliability and fairness of GPT/AI-text detectors by showing they frequently misclassify non-native English writing as AI-generated; concludes this bias threatens to marginalize non-native English speakers in evaluative and educational settings and must be addressed for an equitable digital landscape.","dimensions":["claims","generalisation","overclaiming","methods"],"aiRelated":true,"field":"Natural language processing / AI ethics","verifyNote":"DOI resolves in Crossref to \"GPT detectors are biased against non-native English writers\" in Patterns (2023), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22). Note: abstract is brief; summary kept to what the abstract licenses.","doi_url":"https://doi.org/10.1016/j.patter.2023.100779","target_doi_url":null},{"id":"eady-russian-ira-impact","title":"Exposure to the Russian Internet Research Agency foreign influence campaign on Twitter in the 2016 US election and its relationship to attitudes and voting behavior","authors":["Gregory Eady","Tom Paskhalis","Jan Zilinsky","Richard Bonneau","Jonathan Nagler","Joshua A. Tucker"],"venue":"Nature Communications","tier":"A","year":"2023","doi":"10.1038/s41467-022-35576-9","critiqueType":"replication","target":{"title":"The prevailing claim that the Russian Internet Research Agency's 2016 Twitter campaign had large effects on US attitudes and voting behaviour","authors":[],"doi":null,"year":""},"whatItChallenges":"Challenges the prevailing concern that the Russian Internet Research Agency's 2016 Twitter campaign meaningfully shaped US attitudes and votes. Using longitudinal survey data linked to respondents' Twitter feeds, it finds exposure was extremely concentrated (1% of users saw 70% of exposures), concentrated among strong Republicans, dwarfed by domestic news and politicians, and shows no evidence of a meaningful relationship to changes in attitudes, polarization, or voting behavior.","dimensions":["claims","identification","overclaiming","data_code","generalisation"],"aiRelated":true,"field":"political science","verifyNote":"DOI resolves in Crossref to \"Exposure to the Russian Internet Research Agency foreign influence campaign on Twitter in the 2016 US election and its relationship to attitudes and voting behavior\" in Nature Communications (2023), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22).","doi_url":"https://doi.org/10.1038/s41467-022-35576-9","target_doi_url":null},{"id":"martinez-gpt4-bar-exam","title":"Re-evaluating GPT-4’s bar exam performance","authors":["Eric Martínez"],"venue":"Artificial Intelligence and Law","tier":"A","year":"2024","doi":"10.1007/s10506-024-09396-9","critiqueType":"reanalysis","target":{"title":"GPT-4 passes the bar exam","authors":["Daniel Martin Katz","Michael James Bommarito","Shang Gao","Pablo Arredondo"],"doi":"10.1098/rsta.2023.0254","year":"2024"},"whatItChallenges":"It challenges OpenAI's headline claim that GPT-4 scored at the 90th percentile on the Uniform Bar Exam, arguing the figure is overinflated because it relies on a skewed February repeat-taker comparison group; the paper estimates GPT-4's percentile drops to roughly the 62nd percentile against first-time takers and roughly the 48th percentile (about 15th on essays) against those who actually passed, and it questions the validity of the reported essay/scaled (298) score while finding that few-shot chain-of-thought prompting, but not temperature, significantly affects MBE performance.","dimensions":["methods","statistics","claims","reproducibility","overclaiming","generalisation"],"aiRelated":true,"field":"AI and Law","verifyNote":"DOI resolves in Crossref to \"Re-evaluating GPT-4’s bar exam performance\" in Artificial Intelligence and Law (2024), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22).","doi_url":"https://doi.org/10.1007/s10506-024-09396-9","target_doi_url":"https://doi.org/10.1098/rsta.2023.0254"},{"id":"lum-isaac-predict-serve","title":"To Predict and Serve?","authors":["Kristian Lum","William Isaac"],"venue":"Significance","tier":"B","year":"2016","doi":"10.1111/j.1740-9713.2016.00960.x","critiqueType":"critical_commentary","target":{"title":"Place-based predictive-policing systems (PredPol-style) claiming objective, bias-free crime prediction","authors":[],"doi":null,"year":""},"whatItChallenges":"The piece challenges the premise that place-based predictive-policing systems deliver objective, bias-free crime prediction, arguing that because these systems are trained on biased data, their outputs and resulting deployment can reproduce that bias with adverse social consequences. The abstract frames this as an examination of the evidence and social costs rather than reporting a specific empirical result.","dimensions":["data_code","claims","overclaiming","methods"],"aiRelated":true,"field":"criminology / data science (algorithmic fairness in policing)","verifyNote":"DOI resolves in Crossref to \"To Predict and Serve?\" in Significance (2016), indexed as journal-article. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22). Note: abstract is brief; summary kept to what the abstract licenses.","doi_url":"https://doi.org/10.1111/j.1740-9713.2016.00960.x","target_doi_url":null},{"id":"a-comparison-of-deep-learning-performance-against-","title":"A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis","authors":["Xiaoxuan Liu","Livia Faes","Aditya U. Kale","Siegfried K. Wagner","Dun Jack Fu","Alice Bruynseels","et al. (Alastair K. Denniston)"],"venue":"The Lancet Digital Health","tier":"A","year":"2019","doi":"10.1016/s2589-7500(19)30123-2","critiqueType":"critical_commentary","target":{"title":"The body of deep learning diagnostic-imaging studies (2012–2019) that report deep learning algorithm performance in disease classification relative to health-care professionals, particularly those claiming equivalent or superior diagnostic accuracy.","authors":[],"doi":null,"year":""},"whatItChallenges":"Through a systematic review and meta-analysis of 82 studies, it finds that while deep learning diagnostic performance appears equivalent to health-care professionals, very few studies used external validation or compared algorithms and clinicians on the same sample, and poor reporting is pervasive — undermining confidence in the field's accuracy claims. It thus challenges the reliability and generalisability of the existing deep-learning-versus-clinician comparison literature rather than affirming it at face value.","dimensions":["methods","reproducibility","generalisation","overclaiming","claims","statistics"],"aiRelated":true,"field":"Medical artificial intelligence / diagnostic imaging (evidence synthesis)","verifyNote":"DOI resolves in Crossref to \"A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis\" in The Lancet Digital Health (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1016/s2589-7500(19)30123-2","target_doi_url":null},{"id":"ai-for-radiographic-covid-19-detection-selects-sho","title":"AI for radiographic COVID-19 detection selects shortcuts over signal","authors":["Alex J. DeGrave","Joseph D. Janizek","Su-In Lee"],"venue":"Nature Machine Intelligence","tier":"A","year":"2021","doi":"10.1038/s42256-021-00338-7","critiqueType":"reanalysis","target":{"title":"Recently reported deep-learning AI systems that claim to accurately detect COVID-19 from chest radiographs (and, by extension, related CT and other medical-imaging systems trained via the same data-collection approach).","authors":[],"doi":null,"year":""},"whatItChallenges":"Using explainable-AI techniques, the authors re-examine published deep-learning systems claiming accurate COVID-19 detection from chest radiographs and find they rely on confounding \"shortcuts\" rather than medical pathology, so they appear accurate but fail in new hospitals. They further argue that external-data evaluation is insufficient to detect this, since the spurious shortcuts may not degrade performance even in new hospitals.","dimensions":["methods","identification","generalisation","reproducibility","overclaiming","claims"],"aiRelated":true,"field":"Medical imaging / machine learning (AI in radiology)","verifyNote":"DOI resolves in Crossref to \"AI for radiographic COVID-19 detection selects shortcuts over signal\" in Nature Machine Intelligence (2021). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1038/s42256-021-00338-7","target_doi_url":null},{"id":"external-validation-of-a-widely-implemented-propri","title":"External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients","authors":["Andrew Wong","Erkin Otles","John P. Donnelly","Andrew Krumm","Jeffrey McCullough","Olivia DeTroyer-Cooley","Justin Pestrue","Marie Phillips"],"venue":"JAMA Internal Medicine","tier":"A","year":"2021","doi":"10.1001/jamainternmed.2021.2626","critiqueType":"replication","target":{"title":"The Epic Sepsis Model (ESM), a proprietary sepsis early-warning prediction algorithm implemented at hundreds of US hospitals.","authors":[],"doi":null,"year":""},"whatItChallenges":"This study externally validates the proprietary Epic Sepsis Model on 38,455 hospitalizations and finds it has poor discrimination (AUROC 0.63) and calibration, missing 67% of sepsis patients while generating alerts for 18% of all hospitalizations (high alert-fatigue burden). It argues the model's widespread adoption despite this poor performance raises fundamental concerns about national sepsis management.","dimensions":["statistics","claims","overclaiming","generalisation","methods"],"aiRelated":true,"field":"Clinical medicine / medical informatics (sepsis prediction)","verifyNote":"DOI resolves in Crossref to \"External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients\" in JAMA Internal Medicine (2021). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1001/jamainternmed.2021.2626","target_doi_url":null},{"id":"fair-prediction-with-disparate-impact-a-study-of-b","title":"Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments","authors":["Alexandra Chouldechova"],"venue":"Big Data","tier":"B","year":"2017","doi":"10.1089/big.2016.0047","critiqueType":"reanalysis","target":{"title":"Recidivism prediction instruments (RPIs) and the recently-applied fairness criteria used to assess them; implicitly the dispute over whether such instruments (e.g., the kind at the center of the controversy referenced) exhibit discriminatory bias.","authors":[],"doi":null,"year":""},"whatItChallenges":"It challenges the assumption that an RPI can simultaneously satisfy all of the several recently-proposed fairness criteria, demonstrating that these criteria are mutually incompatible whenever recidivism prevalence differs across groups, and shows that disparate impact can arise when an instrument fails to achieve error-rate balance.","dimensions":["statistics","claims","methods","theory"],"aiRelated":true,"field":"Machine learning fairness / criminal justice risk assessment","verifyNote":"DOI resolves in Crossref to \"Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments\" in Big Data (2017). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1089/big.2016.0047","target_doi_url":null},{"id":"the-limits-of-human-predictions-of-recidivism","title":"The limits of human predictions of recidivism","authors":["Zhiyuan Lin","Jongbin Jung","Sharad Goel","Jennifer Skeem"],"venue":"Science Advances","tier":"A","year":"2020","doi":"10.1126/sciadv.aaz0652","critiqueType":"replication","target":{"title":"Dressel and Farid's experiment finding that laypeople were as accurate as statistical algorithms in predicting recidivism (published as \"The accuracy, fairness, and limits of predicting recidivism\").","authors":[],"doi":null,"year":""},"whatItChallenges":"The abstract reports a replication and extension of Dressel and Farid's study: under similar conditions it reproduces their finding that humans and algorithms perform comparably, but in three other datasets and in conditions without immediate feedback or with an enriched set of risk factors, algorithms outperformed humans. It concludes that algorithms can beat human recidivism predictions in ecologically valid settings, challenging the original's broader implication that risk tools add little value.","dimensions":["methods","generalisation","claims","overclaiming"],"aiRelated":true,"field":"Computational social science / criminal justice risk assessment","verifyNote":"DOI resolves in Crossref to \"The limits of human predictions of recidivism\" in Science Advances (2020). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1126/sciadv.aaz0652","target_doi_url":null},{"id":"evaluating-replicability-of-laboratory-experiments","title":"Evaluating replicability of laboratory experiments in economics","authors":["Colin F. Camerer","Anna Dreber","Eskil Forsell","et al."],"venue":"Science","tier":"S","year":"2016","doi":"10.1126/science.aaf0918","critiqueType":"replication","target":{"title":"18 laboratory experiments in economics published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014","authors":[],"doi":null,"year":""},"whatItChallenges":"The authors directly replicated 18 economics laboratory studies from AER and QJE (2011-2014) using pre-registered analysis plans with at least 90% statistical power, finding a significant same-direction effect in only 11 of 18 (61%) replications, with replicated effect sizes averaging 66% of the originals. This empirically tests and partially undercuts the reliability of the original published findings.","dimensions":["reproducibility","statistics","methods","claims"],"aiRelated":false,"field":"Experimental economics","verifyNote":"DOI resolves in Crossref to \"Evaluating replicability of laboratory experiments in economics\" in Science (2016). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1126/science.aaf0918","target_doi_url":null},{"id":"common-pitfalls-and-recommendations-for-using-mach","title":"Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans","authors":["Michael Roberts","Derek Driggs","Matthew Thorpe","et al. (AIX-COVNET)"],"venue":"Nature Machine Intelligence","tier":"A","year":"2021","doi":"10.1038/s42256-021-00307-0","critiqueType":"critical_commentary","target":{"title":"The body of 2020 machine-learning models published as papers/preprints (62 studies, screened from 2,212) claiming to diagnose or prognosticate COVID-19 from chest X-ray (CXR) and CT images.","authors":[],"doi":null,"year":""},"whatItChallenges":"Through a systematic review of all CXR/CT machine-learning COVID-19 models published between 1 Jan and 3 Oct 2020, it finds that none of the 62 included models is of potential clinical use due to methodological flaws and/or underlying biases, and it issues recommendations to remedy these problems.","dimensions":["methods","data_code","reproducibility","overclaiming","generalisation","claims"],"aiRelated":true,"field":"Medical machine learning / radiology AI (COVID-19 imaging)","verifyNote":"DOI resolves in Crossref to \"Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans\" in Nature Machine Intelligence (2021). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1038/s42256-021-00307-0","target_doi_url":null},{"id":"deep-learning-predicts-hip-fracture-using-confound","title":"Deep learning predicts hip fracture using confounding patient and healthcare variables","authors":["Marcus A. Badgeley","John R. Zech","Luke Oakden-Rayner","et al."],"venue":"npj Digital Medicine","tier":"A","year":"2019","doi":"10.1038/s41746-019-0105-1","critiqueType":"reanalysis","target":{"title":"Computer-aided diagnosis / deep-learning models that claim to detect hip fractures from pelvic radiographs, by re-examining what image features such models actually leverage.","authors":[],"doi":null,"year":""},"whatItChallenges":"It challenges the validity of deep-learning CAD models for hip-fracture detection by showing that the model also predicts patient traits and 14 hospital process variables (e.g., scanner model AUC=1.00, \"priority\" order AUC=0.79) from the same radiographs, and that fracture-prediction performance collapses to random (AUC=0.52) once fracture risk is balanced across these patient and process variables — indicating the model's apparent accuracy is largely driven by confounding shortcuts rather than genuine fracture features.","dimensions":["identification","methods","claims","overclaiming","generalisation"],"aiRelated":true,"field":"Medical imaging / clinical machine learning (radiology AI)","verifyNote":"DOI resolves in Crossref to \"Deep learning predicts hip fracture using confounding patient and healthcare variables\" in npj Digital Medicine (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1038/s41746-019-0105-1","target_doi_url":null},{"id":"counting-chickens-when-they-hatch-timing-and-the-e","title":"Counting Chickens when they Hatch: Timing and the Effects of Aid on Growth","authors":["Michael A. Clemens","Steven Radelet","Rikhil R. Bhavnani","Samuel Bazzi"],"venue":"The Economic Journal","tier":"A","year":"2012","doi":"10.1111/j.1468-0297.2011.02482.x","critiqueType":"reanalysis","target":{"title":"The three most influential published cross-country aid-growth studies (referenced collectively; the proposed target names the Aid, Policies, and Growth / Burnside-Dollar-style aid-growth literature), whose regression designs the authors re-estimate.","authors":[],"doi":null,"year":""},"whatItChallenges":"It challenges the divergent cross-country aid-growth estimates of three influential prior studies by re-running their exact regression specifications while adding realistic lag assumptions about aid's timing and dropping invalid or weak instruments. With these changes all three designs converge on the finding that increases in aid are followed by modest increases in investment and growth, implying aid causes some modest growth that varies across recipients and diminishes at high aid levels.","dimensions":["identification","methods","statistics","claims"],"aiRelated":false,"field":"Development economics / empirical macroeconomics","verifyNote":"DOI resolves in Crossref to \"Counting Chickens when they Hatch: Timing and the Effects of Aid on Growth\" in The Economic Journal (2012). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1111/j.1468-0297.2011.02482.x","target_doi_url":null},{"id":"the-echo-chamber-is-overstated-the-moderating-effe","title":"The echo chamber is overstated: the moderating effect of political interest and diverse media","authors":["Elizabeth Dubois","Grant Blank"],"venue":"Information, Communication &amp; Society","tier":"B","year":"2018","doi":"10.1080/1369118x.2018.1428656","critiqueType":"critical_commentary","target":{"title":"The echo-chamber / filter-bubble thesis that, in a high-choice media environment, individuals (especially the politically interested) select self-reinforcing content and become segregated into homogeneous, partisan information environments — including prior single-medium studies that operationalize \"being in an echo chamber\" through narrow definitions and measurements.","authors":[],"doi":null,"year":""},"whatItChallenges":"Using a nationally representative survey of UK adult internet users (N=2000) and five echo-chamber measures, the paper challenges the prevailing echo-chamber thesis by showing that politically interested people and those with diverse media diets tend to avoid echo chambers, so only a small population segment is actually caught in one. It further argues that single-medium studies and those using narrow definitions/measurements are flawed because they fail to test the theory in a realistic multi-media environment.","dimensions":["claims","methods","overclaiming","theory","generalisation"],"aiRelated":true,"field":"Communication / political communication / media studies","verifyNote":"DOI resolves in Crossref to \"The echo chamber is overstated: the moderating effect of political interest and diverse media\" in Information, Communication &amp; Society (2018). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed (genuine post-publication critique).","doi_url":"https://doi.org/10.1080/1369118x.2018.1428656","target_doi_url":null},{"id":"the-parable-of-google-flu-traps-in-big-data-analys","title":"The Parable of Google Flu: Traps in Big Data Analysis","authors":["David Lazer","Ryan Kennedy","Gary King","Alessandro Vespignani"],"venue":"Science","tier":"S","year":"2014","doi":"10.1126/science.1248506","critiqueType":"critical_commentary","target":{"title":"Google Flu Trends (GFT), the search-query-based influenza-tracking system built to predict CDC influenza-like-illness estimates, widely cited as an exemplary use of big data.","authors":[],"doi":null,"year":""},"whatItChallenges":"The piece challenges the celebrated big-data system Google Flu Trends, noting that despite being built to predict CDC influenza-like-illness reports, in February 2013 it predicted more than double the proportion of doctor visits that the CDC reported. It argues these large, largely avoidable prediction errors hold broader lessons about the pitfalls of big data analysis.","dimensions":["methods","claims","overclaiming","reproducibility"],"aiRelated":true,"field":"Data science / computational epidemiology","verifyNote":"DOI resolves in Crossref to \"The Parable of Google Flu: Traps in Big Data Analysis\" in Science (2014). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.","doi_url":"https://doi.org/10.1126/science.1248506","target_doi_url":null},{"id":"variable-generalization-performance-of-a-deep-lear","title":"Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study","authors":["John R. Zech","Marcus A. Badgeley","Manway Liu","Anthony B. Costa","Joseph J. Titano","Eric Karl Oermann"],"venue":"PLOS Medicine","tier":"A","year":"2018","doi":"10.1371/journal.pmed.1002683","critiqueType":"reanalysis","target":{"title":"A class/body of prior deep-learning work claiming high diagnostic accuracy for CNN-based pneumonia detection on chest radiographs, and the broader assumption that such image-classification CNNs generalize well to new data. The abstract references \"recent work\" and prior optimism but does not name a specific paper or system (e.g., it does not mention CheXNet by name).","authors":[],"doi":null,"year":""},"whatItChallenges":"It challenges the assumption that pneumonia-detection CNNs generalize across hospitals, showing models performed better internally than externally in 3 of 5 comparisons and that differing pneumonia prevalence between sites let a model reach AUC 0.861 merely by sorting hospital. It further shows CNNs can identify the source hospital from a radiograph with ~99.95-99.98% accuracy, implying reported accuracy may reflect site-specific confounding rather than true pathology detection.","dimensions":["generalisation","methods","identification","overclaiming","claims","statistics"],"aiRelated":true,"field":"Medical imaging / machine learning in radiology (computer-aided diagnosis)","verifyNote":"DOI resolves in Crossref to \"Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study\" in PLOS Medicine (2018). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.","doi_url":"https://doi.org/10.1371/journal.pmed.1002683","target_doi_url":null},{"id":"stop-explaining-black-box-machine-learning-models-","title":"Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead","authors":["Cynthia Rudin"],"venue":"Nature Machine Intelligence","tier":"A","year":"2019","doi":"10.1038/s42256-019-0048-x","critiqueType":"critical_commentary","target":{"title":"The practice of using \"explainable AI\" methods to post-hoc explain black-box machine learning models deployed for high-stakes decisions, named at the domain level (criminal justice, healthcare, computer vision); the abstract does not name COMPAS specifically.","authors":[],"doi":null,"year":""},"whatItChallenges":"It argues that attempting to explain black-box models post hoc, rather than building inherently interpretable models, perpetuates bad practice and risks great societal harm in high-stakes settings. It contends that for applications directly affecting human lives (healthcare, criminal justice), effort should go toward inherently interpretable models, which it claims could often replace black boxes.","dimensions":["methods","claims","theory","overclaiming"],"aiRelated":true,"field":"Machine learning / AI ethics and interpretability","verifyNote":"DOI resolves in Crossref to \"Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead\" in Nature Machine Intelligence (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.","doi_url":"https://doi.org/10.1038/s42256-019-0048-x","target_doi_url":null},{"id":"many-labs-4-failure-to-replicate-mortality-salienc","title":"Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement","authors":["Richard A. Klein","Corey L. Cook","Charles R. Ebersole","et al."],"venue":"Collabra Psychology","tier":"B","year":"2019","doi":"10.1525/collabra.35271","critiqueType":"replication","target":{"title":"The classic mortality salience / worldview-defense finding from Terror Management Theory, specifically Greenberg et al. (1994).","authors":[],"doi":null,"year":""},"whatItChallenges":"A 17-lab, ~1,550-participant preregistered replication that failed to reproduce the classic Terror Management Theory mortality salience effect (Greenberg et al., 1994) under any condition, including with original-author involvement in study design. The authors conclude the original finding was either a false positive or that the conditions required to obtain it are not understood or no longer exist.","dimensions":["reproducibility","methods","statistics","claims"],"aiRelated":false,"field":"Social psychology","verifyNote":"DOI resolves in Crossref to \"Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement\" in Collabra Psychology (2019). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.","doi_url":"https://doi.org/10.1525/collabra.35271","target_doi_url":null},{"id":"replicating-anomalies","title":"Replicating Anomalies","authors":["Kewei Hou","Chen Xue","Lu Zhang"],"venue":"Review of Financial Studies","tier":"A","year":"2017","doi":"10.1093/rfs/hhy131","critiqueType":"replication","target":{"title":"The published cross-sectional stock-return anomaly literature — the body of 452 documented anomalies (including the trading-frictions category) and their originally reported return predictability.","authors":[],"doi":null,"year":""},"whatItChallenges":"Re-testing 452 published anomalies with microcaps controlled via NYSE breakpoints and value-weighted returns, the study finds 65% fail the single-test |t|>=1.96 hurdle (82% under a 2.78 multiple-testing hurdle), and that even surviving anomalies have much smaller economic magnitudes than originally reported. It concludes capital markets are more efficient than the prior literature recognized.","dimensions":["methods","statistics","reproducibility","overclaiming","claims"],"aiRelated":false,"field":"Empirical asset pricing / finance","verifyNote":"DOI resolves in Crossref to \"Replicating Anomalies\" in Review of Financial Studies (2017). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.","doi_url":"https://doi.org/10.1093/rfs/hhy131","target_doi_url":null},{"id":"variability-may-limit-the-translation-of-neuroimag","title":"Variability may limit the translation of neuroimaging findings comment on “Variability in the analysis of a single neuroimaging dataset by many teams”","authors":["Rotem Botvinik-Nezer et al. (NARPS)"],"venue":"Journal of Affective Disorders","tier":"B","year":"2020","doi":"10.1016/j.jad.2020.09.048","critiqueType":"comment","target":{"title":"Botvinik-Nezer et al. (2020), \"Variability in the analysis of a single neuroimaging dataset by many teams\" (the NARPS study), and more broadly the hypothesis-testing paradigm in standard fMRI analysis.","authors":[],"doi":null,"year":""},"whatItChallenges":"It comments that the analytic variability documented when many teams analyzed the same fMRI dataset undermines the persuasiveness of fMRI findings, and argues this reproducibility problem stems from the hypothesis-testing paradigm. It proposes that a machine-learning-based predictive-modeling approach could mitigate the issue and better detect subtle spatial brain patterns and individual-level effects.","dimensions":["reproducibility","methods","statistics"],"aiRelated":false,"field":"Neuroimaging / psychiatry (fMRI methodology)","verifyNote":"DOI resolves in Crossref to \"Variability may limit the translation of neuroimaging findings comment on “Variability in the analysis of a single neuroimaging dataset by many teams”\" in Journal of Affective Disorders (2020). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre confirmed.","doi_url":"https://doi.org/10.1016/j.jad.2020.09.048","target_doi_url":null},{"id":"camerer-nhb-replicability-2018","title":"Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015","authors":["Colin F. Camerer","Anna Dreber","Felix Holzmeister","Teck-Hua Ho","Jürgen Huber","Magnus Johannesson","Michael Kirchler","Gideon Nave","Brian A. Nosek","Thomas Pfeiffer"],"venue":"Nature Human Behaviour","tier":"S","year":"2018","doi":"10.1038/s41562-018-0399-z","critiqueType":"replication","target":{"title":"21 systematically selected social-science experiments published in Nature and Science (2010–2015)","authors":[],"doi":null,"year":""},"whatItChallenges":"It re-tests 21 high-profile social-science experiments via pre-registered, high-powered replications (sample sizes ~5x the originals) and finds a significant same-direction effect for only 13 (62%), with replication effect sizes averaging about 50% of the originals, indicating that both false positives and inflated true-positive effect sizes contribute to imperfect reproducibility. It further reports that peer beliefs predicted which results would replicate, implying failures were not due to chance alone.","dimensions":["reproducibility","statistics","methods","claims"],"aiRelated":false,"field":"Social science (experimental); metascience","verifyNote":"DOI resolves to this exact title in Nature Human Behaviour (2018); abstract captured from the publisher landing page (nature.com). whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre (large-scale pre-registered replication, the Social Sciences Replication Project) confirmed.","doi_url":"https://doi.org/10.1038/s41562-018-0399-z","target_doi_url":null},{"id":"steele-aronson-stereotype-threat-reply-2004","title":"Stereotype Threat Does Not Live by Steele and Aronson (1995) Alone","authors":["Claude M. Steele","Joshua Aronson"],"venue":"American Psychologist","tier":"A","year":"2004","doi":"10.1037/0003-066X.59.1.47","critiqueType":"reply","target":{"title":"Sackett, Hardison & Cullen's critique of the stereotype-threat interpretation of the Black–White test-score gap","authors":[],"doi":null,"year":""},"whatItChallenges":"It challenges Sackett et al.'s critique by arguing that their extremely narrow focus on the reporting of a single experiment from the first stereotype-threat article greatly exaggerates three issues, which the comment then addresses in turn. It defends the broader stereotype-threat literature against the claim that the original experiment has been pervasively mischaracterized.","dimensions":["claims","overclaiming","generalisation"],"aiRelated":false,"field":"Psychology (social/educational psychology)","verifyNote":"DOI resolves to this exact title/citation in American Psychologist 59(1):47–48 (2004); abstract captured from APA PsycNet. whatItChallenges grounded in the abstract via the faithfulness ingestion gate (2026-06-22); genre (original authors' reply/rejoinder to the Sackett et al. critique) confirmed.","doi_url":"https://doi.org/10.1037/0003-066X.59.1.47","target_doi_url":null},{"id":"vyas-race-correction-clinical-algorithms-2020","title":"Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms","authors":["Darshali A. Vyas","Leo G. Eisenstein","David S. Jones"],"venue":"New England Journal of Medicine","tier":"S","year":"2020","doi":"10.1056/NEJMms2004740","critiqueType":"critical_commentary","target":{"title":"Diagnostic algorithms and clinical practice guidelines that apply race correction to their outputs","authors":[],"doi":null,"year":""},"whatItChallenges":"It challenges the embedded practice of race correction in clinical diagnostic algorithms and guidelines, arguing that adjusting outputs on the basis of race or ethnicity may steer more clinical attention or resources toward white patients than toward racial and ethnic minority patients.","dimensions":["methods","claims","generalisation"],"aiRelated":true,"field":"Medicine (clinical algorithms)","verifyNote":"DOI resolves to this exact title in N Engl J Med 383:874–882 (2020); abstract captured from the publisher landing page (nejm.org). whatItChallenges grounded in the (brief, single-sentence) abstract via the faithfulness ingestion gate (2026-06-22); genre (critical commentary on a class of race-corrected clinical algorithms) confirmed. Abstract characterizes a category of tools collectively rather than re-analyzing one named study.","doi_url":"https://doi.org/10.1056/NEJMms2004740","target_doi_url":null},{"id":"miguel-kremer-deworming-reply-2015","title":"Commentary: Deworming externalities and schooling impacts in Kenya: a comment on Aiken et al. (2015) and Davey et al. (2015)","authors":["Edward Miguel","Michael Kremer"],"venue":"International Journal of Epidemiology","tier":"A","year":"2015","doi":"10.1093/ije/dyv129","critiqueType":"reply","target":{"title":"Aiken et al. (2015) and Davey et al. (2015) re-analysis of the Miguel & Kremer (2004) Kenya deworming study","authors":[],"doi":null,"year":""},"whatItChallenges":"This is the original authors' reply to two re-analyses of their 2004 deworming study; they acknowledge the re-analyses corrected some errors but argue the updated results are extremely similar to the originals, with externality and school-participation effects remaining significant, so their key conclusion (that individually randomized studies underestimate deworming impacts) still holds. Rather than challenging a prior finding, it defends the original study by interpreting the re-analysis as confirmatory.","dimensions":["statistics","reproducibility","claims","methods"],"aiRelated":false,"field":"Epidemiology / development economics","verifyNote":"DOI resolves to this exact title in International Journal of Epidemiology 44(5):1593 (2015); published extract captured from the publisher landing page (academic.oup.com). whatItChallenges grounded in the extract via the faithfulness ingestion gate (2026-06-22); genre (original authors' reply in a published reanalysis exchange) confirmed.","doi_url":"https://doi.org/10.1093/ije/dyv129","target_doi_url":null},{"id":"barreca-saving-babies-vlbw-2011","title":"Saving Babies? Revisiting the Effect of Very Low Birth Weight Classification","authors":["Alan I. Barreca","Melanie Guldi","Jason M. Lindo","Glen R. Waddell"],"venue":"The Quarterly Journal of Economics","tier":"S","year":"2011","doi":"10.1093/qje/qjr042","critiqueType":"reanalysis","target":{"title":"Almond, Doyle, Kowalski & Williams (2010), \"Estimating Marginal Returns to Medical Care: Evidence from At-Risk Newborns\" (QJE) — its RD finding that 1-year infant mortality drops ~1pp as birth weight crosses the 1,500-g VLBW threshold","authors":[],"doi":null,"year":""},"whatItChallenges":"The paper challenges ADKW (2010)'s RD estimate that crossing the 1,500-g VLBW threshold reduces 1-year infant mortality by about one percentage point, showing that because the running variable exhibits extensive heaping at 1-oz and 100-g multiples, the point estimate is highly sensitive to dropping observations near the threshold. It concludes this sensitivity weakens confidence in the original, policy-relevant result.","dimensions":["identification","methods","statistics","reproducibility","claims","overclaiming"],"aiRelated":false,"field":"Economics (health economics / applied econometrics)","verifyNote":"DOI 10.1093/qje/qjr042 confirmed by the downloaded full-text PDF (QJE 126(4):2117–2123, 2011). whatItChallenges grounded in the article's abstract+intro (full text) via the faithfulness ingestion gate (2026-06-22); gate refined the proposed type comment → reanalysis (it re-estimates the original RD design). Source: user-supplied PDF download.","doi_url":"https://doi.org/10.1093/qje/qjr042","target_doi_url":null},{"id":"rouse-further-estimates-return-schooling-1999","title":"Further estimates of the economic return to schooling from a new sample of twins","authors":["Cecilia Elena Rouse"],"venue":"Economics of Education Review","tier":"B","year":"1999","doi":"10.1016/S0272-7757(98)00038-7","critiqueType":"reanalysis","target":{"title":"Ashenfelter & Krueger (1994), \"Estimates of the Economic Return to Schooling from a New Sample of Twins\" (AER)","authors":[],"doi":null,"year":""},"whatItChallenges":"Re-examining Ashenfelter & Krueger's twins estimates using three additional years of the same survey, the paper finds the within-twin return estimate is smaller than the cross-sectional one (implying a small upward bias in the cross-section), reversing Ashenfelter & Krueger's reported pattern — though their measurement-error-corrected estimates are statistically indistinguishable from these. It also finds evidence of an important individual-specific component to measurement error in schooling reports.","dimensions":["statistics","methods","identification","data_code"],"aiRelated":false,"field":"Economics (economics of education / labor economics)","verifyNote":"DOI 10.1016/S0272-7757(98)00038-7 confirmed by the downloaded full-text PDF (Economics of Education Review 18:149–157, 1999; PII S0272-7757(98)00038-7). whatItChallenges grounded in the article's abstract (full text) via the faithfulness ingestion gate (2026-06-22); genuine reanalysis of Ashenfelter & Krueger (1994). Source: user-supplied PDF download.","doi_url":"https://doi.org/10.1016/S0272-7757(98)00038-7","target_doi_url":null},{"id":"oster-missing-women-comment-das-gupta-2006","title":"On Explaining Asia's \"Missing Women\": Comment on Das Gupta","authors":["Emily Oster"],"venue":"Population and Development Review","tier":"A","year":"2006","doi":"10.1111/j.1728-4457.2006.00120.x","critiqueType":"reply","target":{"title":"Monica Das Gupta (2005) comment arguing the hepatitis-B explanation of Asia's 'missing women' is unlikely to be important","authors":[],"doi":null,"year":""},"whatItChallenges":"Oster replies to Das Gupta's objection that sex-ratio patterns over time and across families (girls faring worse under resource constraints, later births skewing male after earlier girls) point to cultural/son-preference explanations rather than hepatitis B. Oster defends the hepatitis B mechanism alongside cultural factors, distinguishing the working-paper and published versions of her own analysis.","dimensions":["claims","identification","generalisation","theory"],"aiRelated":false,"field":"Demography / development economics","verifyNote":"DOI 10.1111/j.1728-4457.2006.00120.x confirmed by the downloaded full-text PDF (Population and Development Review 32(2):323–327, 2006). whatItChallenges grounded in the article's opening (full text) via the faithfulness ingestion gate (2026-06-22); gate confirmed type=reply (Oster's reply to Das Gupta's comment, distinct from the critic's comment). Titled 'Comment on Das Gupta' but functionally Oster's reply defending her prior work. Source: user-supplied PDF download.","doi_url":"https://doi.org/10.1111/j.1728-4457.2006.00120.x","target_doi_url":null},{"id":"card-krueger-minimum-wage-reply-2000","title":"Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania: Reply","authors":["David Card","Alan B. Krueger"],"venue":"American Economic Review","tier":"S","year":"2000","doi":"10.1257/aer.90.5.1397","critiqueType":"reply","target":{"title":"Neumark & Wascher (2000), Comment on Card & Krueger (1994) — using EPI/payroll-record data to argue the 1992 NJ minimum-wage increase reduced fast-food employment","authors":["David Neumark","William Wascher"],"doi":"10.1257/aer.90.5.1362","year":"2000"},"whatItChallenges":"It responds to Neumark & Wascher's Comment, which attributed the contrary \"employment decline\" finding to flaws in Card & Krueger's telephone-survey employment data versus payroll records. Card & Krueger attempt to reconcile the two by analyzing administrative employment data from a new representative sample of NJ and PA fast-food employers, reanalyzing NW's data, and most importantly using BLS employer-reported ES-202 data to track a fixed longitudinal sample of major-chain establishments from 1992 to 1993.","dimensions":["data_code","methods","reproducibility","claims","statistics"],"aiRelated":false,"field":"Economics (labor economics)","verifyNote":"DOI 10.1257/aer.90.5.1397 confirmed by the downloaded full-text PDF (AER 90(5):1397–1420, Dec 2000; JSTOR sici 0002-8282…1397). whatItChallenges grounded in the article's full text (abstract+intro) via the faithfulness ingestion gate (2026-06-22); genre confirmed = reply (Card & Krueger's reply to the Neumark & Wascher (2000) Comment on their 1994 AER study). Target DOI links the in-corpus Neumark-Wascher Comment (id neumark-minimum-wages-employment) — an explicit Comment↔Reply exchange. Source: user-supplied PDF download (replaces the earlier wrong-file njmin-aer.pdf, which was the 1994 original).","doi_url":"https://doi.org/10.1257/aer.90.5.1397","target_doi_url":"https://doi.org/10.1257/aer.90.5.1362"}]}