Browse

Browse the critiques

Every per-paper critique, sliceable by the journal’s own structured metadata — the social-science domain of the target, the severity, the access basis it was read at, its calibration verdict against the human-expert standard, and the AI/AGI theme. Search by title, author, or venue. Every facet re-derives in-app.

63 of 63 critiques

needs reviewOther / interdisciplinaryseverity moderatelicensed access
Critique of “Multimodal large language models can make context-sensitive hate speech evaluations aligned with human judgement”
Thomas Davidson · Nature Human Behaviour · 2025-12-15
Both flaws survive in forms narrower than originally pled, at moderate severity. The empirical findings stand and the framework is auditable; the headline capability claim ('screening at scale, providing context-sensitive decisions') and the abstract-level 'closely aligned with human judgement' (the title says only 'aligned') require material qualification they do not currently carry - the former is format-bound against the paper's own deployment-like evidence, the latter is an uncriterioned gloss contradicted by the paper's own intervals on the most consequential attribute. These are overgeneralized verdicts atop sound, transparently reported measurements, not invalid data or analysis.
✓ calibratedPsychologyseverity moderateopen access
Critique of “On the conversational persuasiveness of GPT-4”
Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti et al. · Nature Human Behaviour · 2025
This is a carefully executed and transparently reported preregistered experiment that makes a genuine contribution by moving AI persuasion research from static text comparisons to live interactive debates. However, the headline framing substantially overstates the scope of the significant result, the mechanistic attribution to AI argument quality is unsupported by the design, and the single-item immediate outcome measure limits construct validity. These are bounded overclaims on an otherwise methodologically sound study.
needs reviewPsychologyseverity moderatelicensed access
Critique of “Cultural tendencies in generative AI”
Jackson G. Lu, Lesley Luyang Song, Lu Doris Zhang · Nature Human Behaviour · 2025
A well-executed descriptive study with converging evidence across measures and models, weakened primarily by labelling a model-output difference as 'real-world impact' without any human-outcome data, and secondarily by proposing a single causal mechanism that its language-manipulation design cannot isolate from confounds. These are bounded overclaims on an otherwise rigorous and transparent study that honestly discloses its sampling and model limitations.
needs reviewPsychologyseverity moderateopen access
Critique of “How human–AI feedback loops alter human perceptual, emotional and social judgements”
Moshe Glickman, Tali Sharot · Nature Human Behaviour · 2025
This is a well-executed experimental programme published in a top venue, with genuine strengths in design variety, statistical rigour, and transparent data sharing. The core finding — that interacting with a biased AI makes human judgements more biased in the short term, while interacting with an accurate AI improves them — is well supported across multiple paradigms. However, the paper's central framing as a 'feedback loop' and 'snowball effect' overclaims beyond what a single-pass design can demonstrate. The overclaim is consequential because it drives the paper's strongest policy implications and Discussion extrapolations. Secondary concerns about the Experiment 3 control design and absence of preregistration are genuine but bounded.
needs reviewPsychologyseverity moderatelicensed access
Critique of “Comparing the value of perceived human versus AI-generated empathy”
Matan Rubin, Joanna Z. Li, Federico Zimmerman et al. · Nature Human Behaviour · 2025
This is a well-powered, preregistered, multi-study investigation of an important and timely question. Its core descriptive finding — that labelling identical empathic responses as human versus AI shifts self-reported empathy ratings — is robust and well-replicated internally. However, the paper's main theoretical advance (the differential role of affective/motivational versus cognitive empathy) depends on a measurement instrument with poor CFA fit, and the conclusions section makes an unsupported inferential leap from perceptual labelling effects to claims about AI's capacity to feel and care. These are bounded overclaims on an otherwise sound body of work.
needs reviewPolitical scienceseverity moderateopen access
Critique of “Positioning Political Texts with Large Language Models by Asking and Averaging”
Gaël Le Mens, Aina Gallego · Political Analysis · 2025
This is a clearly written research letter introducing a practical and promising approach to text scaling with LLMs. However, its validation has a construct-validity gap: the paper cannot fully demonstrate that LLMs position texts based on content rather than recognized-actor associations, because every test case involves well-known politicians — though the tweet-level validation partially addresses this. Secondary concerns include exclusive reliance on correlation without calibration metrics, small sample sizes in two of four tasks, and an abstract-level superiority claim over supervised classifiers that lacks formal statistical support. The authors' own caveats about generalisability are appropriate, but the headline claims — particularly about applicability to lesser-known actors and general superiority over supervised methods — outrun the evidence.
needs reviewCommunication & mediaseverity moderatelicensed access
Critique of “Real-time artificial intelligence sentiment feedback promotes self-moderation in contentious online discussion”
Soo Yun Shin, Seo Hyeong Kim, Dayeong Lee et al. · Journal of Computer-Mediated Communication · 2026
The paper’s within-condition dose-response findings (H3, H4) are internally sound, and RQ1 provides legitimate evidence that observers detect sentiment improvements. However, the two headline claims—that AI scoring causes more revision than general feedback (H1, p=.0498 without correction) and produces ‘cascading positive effects on third-party observers’ (RQ3 null, RQ4 exploratory with 1% variance)—both overstate the evidence. The overclaiming is bounded: the authors disclose Study 2’s exploratory status and the comparison confound in their limitations, but the abstract and conclusion do not reflect these caveats.
needs reviewPolitical scienceseverity moderatelicensed access
Critique of “Reducing political polarization through conversations with artificial intelligence”
Timon M.J. Hruschka, Markus Appel · Journal of Computer-Mediated Communication · 2026
This is a transparently reported, well-powered study with genuine methodological strengths including preregistration, replication, and open materials. However, the abstract's unhedged claim that LLMs are 'powerful tools for individual depolarization' substantially outpaces the evidence, which consists solely of immediate post-conversation self-report shifts with no follow-up measurement. The affective-polarization measure departs from the standard partisan-group construct in ways that may inflate the apparent effect, and the CASA-based theoretical mechanism remains under-identified without a human-interlocutor comparison. These are bounded over-claims on an otherwise sound study.
needs reviewEducationseverity moderateopen access
Critique of “AI in education through the learners’ eyes: practical experience, perceptions, and challenges”
Kostadin Yotov, Silvia Gaftandzhieva, Emil Hadzhikolev et al. · Frontiers in Education · 2026
This is a competently executed descriptive survey whose primary vulnerability is systematic overclaiming in its interpretive passages: cross-sectional correlations are presented as ‘confirmed’ directional mechanisms and translated into institutional policy recommendations, contradicting the paper’s own methodological caveats. The ANN analyses compound this by deploying over-parameterised models without adequate validation reporting. These are bounded overclaims on an otherwise transparent study that honestly discloses its sampling limitations.
needs reviewPolitical scienceseverity moderateopen access
Critique of “Understanding support for AI regulation: A Bayesian network perspective”
Andrea Cremaschi, Dae-Jin Lee, Manuele Leonelli · arXiv preprint · 2025
A methodologically transparent exploratory analysis demonstrating BNs as a useful tool for modeling complex survey belief structures. The primary vulnerability is that the formal variance-decomposition rankings, which carry the paper's headline claims, lack uncertainty quantification, meaning the precise orderings and magnitudes are not statistically grounded despite informal triangulation support. The causal language in the abstract is bounded overclaiming on an otherwise appropriately hedged paper.
needs reviewEducationseverity moderateopen access
Critique of “Whether and When Could Generative AI Improve College Student Learning Engagement?”
Fei Guo, Lanwen Zhang, Tianle Shi et al. · Behavioral Sciences · 2025
This is a well-scaled descriptive study that provides useful preliminary evidence on GenAI-engagement associations across learning contexts in Chinese higher education. The core analytical findings -- nuanced, context-dependent, and often mixed -- are informative. However, the causal language in the discussion ('has replaced,' 'Impacts') overstates what the cross-sectional design can establish, and the missing-flag method for structurally non-random missing data introduces unacknowledged bias risk. These are bounded overclaims on an otherwise competent survey study.
needs reviewPolitical scienceseverity moderateopen access
Critique of “Political ideology shapes support for the use of AI in policy-making”
Tamar Gur, Boaz Hameiri, Yossi Maaravi · Frontiers in Artificial Intelligence · 2024
This is a competently executed exploratory survey study with commendable transparency practices. The causal framing in the title overclaims what cross-sectional data can establish, and the centrist–leftist merging weakens the theoretical contrast the paper relies on, but neither flaw undermines the descriptive findings. The study’s main contribution — documenting ideological differences in AI-governance attitudes during a political crisis — stands as exploratory evidence that warrants replication with stronger designs.
needs reviewPolitical scienceseverity moderateopen access
Critique of “Artificial intelligence and social media as new arenas of political competition: challenges for democracy”
Ildar Kaliyev, Kargash Zhanpeiissova, Danagul Kopezhanova et al. · Frontiers in Political Science · 2026-05-29
This is a competently executed mixed-methods survey study that documents meaningful variation in how Kazakh social media users perceive algorithmic and AI-driven political communication. Its primary weakness is the gap between its correlational design and the causal framing in a key results passage. Secondary concerns about novel unvalidated instruments, a sample restricted to AI-aware users, and limited reproducibility are genuine but less severe.
✓ calibratedPsychologyseverity moderateopen access
Critique of “Mental health in the “era” of artificial intelligence: technostress and the perceived impact on anxiety and depressive disorders—an SEM analysis”
Daniela-Elena Lițan · Frontiers in Psychology · 2025
A competently executed exploratory survey that identifies a modest association between AI-related technostress and self-reported anxiety/depression symptoms. However, the paper’s value is substantially undermined by pervasive causal/predictive framing that its cross-sectional design cannot support, an incorrect claim that SEM fit indices mitigate common-method bias, use of a general-technology stress instrument without AI-specific revalidation evidence, and a power analysis computed for the wrong statistical technique. The disclosed limitations partially acknowledge the cross-sectional and scale-adaptation issues but do not restrain the overclaiming in the Discussion. The paper is best read as hypothesis-generating rather than confirmatory.
needs reviewPsychologyseverity moderateopen access
Critique of “AI-determined similarity increases likability and trustworthiness of human voices”
Oliver Jaggy, Stephan Schwan, Hauke S. Meyerhoff · PLOS ONE · 2025-03-05
This is a competent, transparent, preregistered study whose empirical contribution — that a lightweight d-vector cosine measure moderately tracks human voice-similarity judgments, including self-voice comparisons, and that self-similar voices attract slightly higher likability/trust ratings — is credible but small. Its one serious flaw is dimension-overclaiming: the title and abstract render a correlational, confound-exposed Experiment 5 in causal language ('increases'/'increased'), a claim the design cannot support and that the authors' own internal-validity admission undercuts. Because the body text uses the correct correlational register and the limitations are largely disclosed, the over-claim is bounded to the title/abstract on an otherwise-sound paper, making this moderate rather than high severity. Two secondary, span-grounded points — the aggregated-category-mean R² inflating apparent explained variance over a person-level rho of ~0.15-0.16 with a non-significant key term, and the same-gender/German-only sampling under-supporting the sweeping societal generalization — add coverage without carrying the headline.
needs reviewEducationseverity highopen access
Critique of “Exploring the acceptance of ChatGPT in higher education: a comprehensive quantitative study of university students and faculty”
Mehmet Haldun Kaya, Tufan Adıgüzel · Frontiers in Education · 2025-09-23
Publishable core with a serious, fixable framing defect. The PLS-SEM estimation and its validity diagnostics are sound and internally consistent, and the student-vs-faculty comparison is a real contribution. But the abstract mis-reports the paper's own primary result — labeling a non-significant path (effort expectancy, p = 0.08/0.69) as a "most significant predictor" while omitting the genuinely dominant construct (Habit) — which is the load-bearing finding readers will cite and is directly refuted by Table 10 and the Discussion. This should be corrected before the abstract is relied upon. The secondary flaws (an untested causal gloss on the ~31-person faculty subgroup, an over-reach on generalizability from a single-site convenience sample that contradicts the paper's own limitation, and a mismatch between the abstract's 378/346/32 headcount and the analyzed 351/320/31 sample) are moderate-to-minor and are largely matters of over-claim and inconsistent reporting rather than analytic error.
✓ calibratedCommunication & mediaseverity highopen access
Critique of “Investigating the impact of social media images on users' sentiments towards sociopolitical events based on deep artificial intelligence”
Nafiseh Jabbari Tofighi, Reda Alhajj · PLOS ONE · 2025-07-30
The paper reports a real, moderate-to-large descriptive association between hand-labeled image sentiment and comment positivity across four movements, and it is admirably transparent about its data, code, and several limitations. But its headline claim over-reaches on causal identification: a contemporaneous, same-post cross-sectional correlation cannot establish that images "strongly influence" user reactions when image and comments both respond to the same event, and the paper isolates no direction of effect. This is the single most defensible, hardest-to-refute flaw and it is not author-disclosed. Three secondary, span-grounded gaps compound it without being bundled into the headline: no significance test, p-value, or confidence interval is reported despite each coefficient resting on only n=20 posts (so the movement ranking is asserted, not shown); posts were selected partly on the sentiment dimension being correlated, distorting the coefficient relative to a random sample; and the "linear correlation" between a binary image score and a continuous comment percentage is really a point-biserial coefficient whose magnitude and cross-movement comparability are constrained by each movement's label split. The overall defect is a bounded but clear over-claim - causal and inferential language outrunning an associational, small, purposively selected design - on a paper whose finding of an association is otherwise real and whose limitations are partly disclosed.
needs reviewPolitical scienceseverity moderateuser supplied
Critique of “AI meets politics: Examining the effects of different targeting strategies across 15 countries”
Sanne Kruikemeier, Svenja Schäfer, Alice Hamilton et al. · New Media & Society · 2026-06-04
A solid, honestly-reported study whose main conclusions hold up; the defensible critique is confined to a handful of inference/wording over-reaches rather than design flaws. The single most damaging item is the self-contradictory "reinforcing effect" sentence in the Results, which as printed asserts a finding the study's own null interactions and abstract refute. The EU-membership-on-political-targeting claim, the treatment of a p=.097 coefficient as confirming H1c, and the party-vs-content confound are real but each is partly disclosed or partly inherent to the design, so they are secondary. None rises to the level of overturning the paper's headline (small, mostly-null) findings. Overall severity: moderate.
✓ calibratedCommunication & mediaseverity moderateuser supplied
Critique of “Being literate, behaving literate? A mixed-methods approach to adolescents’ algorithm literacy and behavioral strategies on social media”
Larissa Leonhard, Ruth Wendt, Claudia Riesmeyer · New Media & Society · 2026-05-03
A solid, well-disclosed study whose conclusions are honestly hedged and whose data/code are shared. The genuine over-reaches that survive full-text refutation are presentational and threshold-consistency issues, not evidence that the analysis is wrong: (1) a load-bearing significance legend (*<.01) that contradicts the p-values (.028/.022/.047) of the very paths it marks — almost certainly an erratum but, as printed, capable of nullifying the headline result if read literally; and (2) calling the central SEM 'acceptable' against the authors' own stated SRMR<.05 rule that it (.055) and a CFA (.050) fail. A causal-language slip ('driving') is real but mitigated by the disclosed cross-sectional limitation. Because the effects are small and the flaws are reporting-consistency rather than design-invalidating, overall severity is low-to-moderate. The fair verdict is a calibrated 'mostly sound, fix the table legend and the fit-cutoff inconsistency.'
needs reviewCommunication & mediaseverity moderateuser supplied
Critique of “Beyond disruption and invisibility: Interactional continuity in everyday AI use in India”
Emilia Edwards, Dhiraj Murthy · New Media & Society · 2026-05-26
A methodologically careful, well-hedged qualitative study whose conclusions ("interactional continuity") are supported by its design as descriptive, tendency-level claims. The full text resolves most abstract-era worries: it openly states the sample is small, single-site, non-generalizable, reported as "tendencies and not robust subgroup claims," with no inter-coder reliability and no precision/recall claimed. The genuine residual over-reaches are narrow and quantitative: a span-exact denominator error (a "9% (n = 2)" figure that is internally inconsistent with the stated base of 28 "respondents"), a parallel embedded-GenAI percentage base that conflicts with the study's own Table 1 user count, an inferential "correlate more strongly than" proposition the descriptive design cannot test, and a "saturation" claim that overstates what a 2-day availability-based convenience sample establishes. None of these undermine the central qualitative argument; they are localized reporting/measurement over-reaches. Overall severity is moderate-to-low.
needs reviewPolitical scienceseverity moderateuser supplied
Critique of “From rule of law to rule of algorithm: Generative Artificial Intelligence's threat to democracy”
A.T. Kingsmith · Big Data & Society · 2026-05-30
A moderate critique is the honest outcome. As a commentary the piece is solid, well-sourced in its conceptual claims, and genre-appropriate; the abstract-era "no data/no method" concerns are mostly resolved by the disclosed commentary genre and should be withdrawn. The one defensible empirical over-reach is the "over 60 countries" governmental-deployment claim, which upgrades a Microsoft marketing-blog availability figure into an adoption finding — a real but localized measurement/sourcing flaw. Secondary, weaker issues are the use of pre-generative predictive cases (Chinook, SyRI) to evidence a thesis premised on generative AI's qualitative novelty, and a strong causal 'specifically because' headline backed only by curated single cases. None of these threatens the paper's core argument; they are overstatements at the margins. Overall severity: moderate.
needs reviewCommunication & mediaseverity moderateuser supplied
Critique of “Into the black box: Laypeople's folk theories about generative artificial intelligence chatbots”
Li Z, Nuri Kim, L Chen · Big Data & Society · 2026-05-10
The full text resolves most abstract-era concerns: the high-literacy and non-generalizability worries are explicitly disclosed, and the exploratory inductive design is a legitimate fit for the research questions. What survives adversarial refutation is a genuine but bounded inferential over-reach — the paper foregrounds a light/heavy user comparison and makes subgroup-attributed belief claims that its saturation-only, uncounted, single-coder-initiated method cannot support, and this specific gap is not among the limitations it discloses. A secondary, low-severity sourcing issue is that the motivating scale statistics rest on a marketing blog. Neither flaw undermines the study's central qualitative contribution, which is why the overall severity is moderate rather than high. The honest outcome is a calibrated, not a damning, critique.
needs reviewSociologyseverity moderateuser supplied
Critique of “Making GenAI valuable: Benchmarks, singularities, and the enrichment economy”
Claudia Aradau, Tobias Blanke · Big Data & Society · 2026-05-20
The full text resolves most concerns an abstract-only critique would raise: the method is disclosed, the framework is offered as supplementary rather than totalising, and the paper is unusually careful to hedge its strongest-sounding claims. What survives adversarial refutation is one moderate flaw — the valuation thesis's hinge (investors use benchmarks to value firms / investors are the "primary audience") is an empirical claim about an actor the paper never observes and never cites — plus a low-severity citation inconsistency (FERET "1993" vs "NIST, 1983"). Neither sinks the contribution; the first means the causal/valuation framing is asserted rather than demonstrated and should be read as an interpretive hypothesis, not an established mechanism. Calibrated, honest severity: moderate-leaning-low.
✓ calibratedManagement, IS & marketingseverity moderateuser supplied
Critique of “More Versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review”
Claudine Madras Gartenberg, Sharique Hasan, Alex Murray et al. · Organization Science · 2026-04-27
This is a careful, unusually self-aware descriptive paper that mostly stays within its evidence and pre-discloses its biggest weaknesses (detector limits, author-fixed-effects confounding, the contested meaning of "writing quality"). The one place where the rhetoric genuinely outruns the design is the causal attribution of the submission surge: the abstract/intro assert the volume rise is "primarily due to AI use, not organic growth," but the supporting evidence is a compositional band-decomposition that cannot, even in principle, separate authors migrating from "human" to "AI" detector bands from AI generating genuinely additional submissions — and the paper's sole identification lever (the UTD-Responder DiD) is self-described as noisy and loses robustness on total volume in Panel B. That single over-reach is real and survives full-text refutation; most other abstract-era worries are resolved by the paper's own disclosures. Overall severity is moderate, concentrated in one causal claim, not pervasive.
needs reviewManagement, IS & marketingseverity moderateuser supplied
Critique of “The Cybernetic Teammate: A Field Experiment on Generative AI and Teamwork”
Fabrizio Dell’Acqua, Charles Ayoubi, Hila Lifshitz‐Assaf et al. · Organization Science · 2026-06-12
A methodologically strong, transparently reported, preregistered field experiment whose conclusions are mostly well-supported. The one over-reach that survives adversarial full-text refutation is inferential: the flagship 'AI matched teams' claim is an equivalence statement asserted without an equivalence test or the relevant pairwise comparison, and the point estimates do not actually favor equivalence. Three secondary, lower-severity issues (abstract elevation of an exploratory emotion result, demand/novelty exposure of non-blind self-reported affect, and an untested human-selection-advantage gap) are real but largely acknowledged in the body. Net: a moderate critique honestly stated — the paper is good, and its main vulnerability is rhetorical precision around equivalence rather than design or data integrity.
needs reviewPolitical scienceseverity moderateuser supplied
Critique of “The rise of AI sovereignty: Authoritarian technological imaginaries as a form of reflexive control”
Gregory Asmolov · Big Data & Society · 2026-05-26
This is an honest, well-hedged conceptual Commentary, not an over-claiming empirical study — and most abstract-era worries (small N, quote provenance, "not a leading AI power") are resolved by the genre label, the Supplemental source list, and the author's own self-aware framing. One real over-reach survives full-text refutation: the paper explicitly downgrades reflexive control to "a heuristic device... rather than a direct causal mechanism" and concedes intent is empirically "difficult" to establish, yet the Analysis and Conclusion then assert the downstream cross-regime EFFECT as accomplished fact — that authoritarian actors "shape environments so that even democra-/cies adopt authoritarian logics as rational responses to risk." That effect on non-authoritarian policymaking is the paper's headline contribution, but the design (qualitative discourse analysis of one leader's statements) measures only the SENDER's rhetoric; no democratic adoption, and no causal pathway from Putin's statements to any non-authoritarian outcome, is observed — the BRICS/EU "parallels" are explicitly co-occurrence, not demonstrated influence. The hedge does not inoculate the claim; it is precisely what exposes the conclusion as an internal contradiction. Severity is moderate: the contribution stands as an interpretive lens, but its strongest sentences claim more than a single-actor discourse analysis can license.
✓ calibratedEconomics & financeseverity moderateopen access
Critique of “Large Language Models, Small Labor Market Effects”
Anders Humlum, Emilie Vestergaard · Becker Friedman Institute Working Paper (University of Chicago) · 2025
A methodologically strong, transparently caveated paper whose headline null on register-measured earnings and hours is well-identified and largely robust. The defensible weaknesses are concentrated in how supporting magnitudes are labeled, not in the null: the abstract calls a coarse self-reported time-savings perception a "productivity gain"; the 3-7% pass-through is a slope between two bracketed self-reports in which ~97% report no earnings change, yet is used to argue the pass-through sits within standard literature estimates; and the abstract binds the tightest pooled CI bound (1%) to an "in any occupation" claim whose own occupation-level bound is 6%. None of these overturn the central finding; they qualify its precision and the authority of its mechanism story. I dropped the original 'circularity' sub-claim and the 'CI stated inconsistently across the paper' framing because the full text refutes them (the second credibility fact anchors to administrative DiD estimates, not the same instrument; and the body openly reports all three bounds as distinct estimands). Overall severity: medium.
needs reviewEducationseverity highopen access
Critique of “AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting”
Gregory Kestin, Kelly Miller, Anna Klales et al. · Scientific Reports (Nature Portfolio) · 2025
A rigorous, genuinely strong within-subject RCT whose learning-gain finding is well supported (overwhelming raw effect, identical content across conditions, independent test construction, subgroup + clustering robustness, public data) — but whose headline ATTRIBUTION and SCOPE outrun the design. The realized contrast bundles the AI tutor with at-home/solo/pre-recorded delivery against an in-class/peer/live control, so attributing the gain specifically to AI personalization (title 'outperforms in-class active learning'; 'largely due to its ability to offer personalized feedback on demand') is not identified, and the medium difference is dismissed by assertion rather than tested; the abstract's 'compelling case for its broad adoption' generalizes two lower-Bloom's physics lessons in one elite course well beyond what is shown. A low-severity caveat: the ceiling-corrected 0.73–1.3 SD band is presented with more certainty than its undocumented derivation supports (though the raw effect is independently strong). Severity high — concentrated in the AI-specific causal attribution and the adoption overclaim, not in the existence of the effect. Procedural note: produced by the journal's autonomous production cycle (G105) and run through the hardened convergence gate (survives-majority, stable); the panel restored two low-severity caveats and one draft flaw (unequal analyzed Ns) was dropped as a benign crossover split (142+174=316; data public). Every span independently verified an exact substring of the gold-OA full text; the critique targets claims, methods and inference only, never the authors.
needs reviewEducationseverity highopen access
Critique of “Factors influencing the adoption of generative artificial intelligence into classroom teaching by university teachers: An empirical study using SPSS PROCESS macros”
Yong Xiang, Chenxin Yang, Zhigang Jin et al. · PLOS One · 2025
A publishable-genre but methodologically weak adoption-intention study whose conclusions should be read as exploratory correlational associations, not the causal mechanisms it claims. A full-text convergence panel returned a unanimous survives verdict (the defender could not restore any point). Four span-exact flaws hold: causal overreach from a one-month cross-sectional all-self-report design; a sample whose 36–49 age description contradicts its own 22–45 eligibility rule; reproducibility limits (withheld raw data, a double-assigned citation [24], a garbled results statistic); and an uncited universal-negative novelty claim the paper's own citations undercut. The genuine strengths (standard PLS reliability/validity reporting, 5,000-resample bootstrap CIs, honest disclosure of self-report and generalizability limits) are real but bear on measurement quality and cannot offset the reproducibility and causal-inference problems. Overall severity high, driven primarily by the reproducibility/verifiability issues and the causal overclaiming rather than any single fatal statistical error. Procedural note: produced by the autonomous production cycle (G101); every span independently verified an exact substring of the gold-OA full text; targets claims, methods and inference only, never the authors.
✓ calibratedEconomics & financeseverity moderateopen access
Critique of “The (Short-Term) Effects of Large Language Models on Unemployment and Earnings”
Danqing Chen, Carina Kane, Austin Kozlowski et al. · arXiv (econ.GN) preprint · 2025
Suggestive early evidence whose headline magnitudes should be read cautiously. After an adversarial convergence panel that restored the identification flaw to a framing point and tempered the rest, three calibrated concerns remain: an uncorrected, direction-specific CPS top-code redefinition in the post period that SDiD attenuates but does not eliminate (the standout, moderate); a substantive unemployment null interpreted without an equivalence/precision argument (moderate); and unweighted occupation means without a robustness check (low). The paper earns real credit for transparency about pre-trends, the ITT nature of exposure, its estimator choice, and its reported SEs/CIs. Net severity moderate (softened from an initial 'high' after the panel showed the estimator structure and the explicit relative estimand): the core contribution stands as suggestive, but the $89 magnitude and the 'no employment effect' conclusion are the parts to hold loosely until the top-code is harmonized and the null is backed by a precision argument. Procedural note: produced by the autonomous production cycle (G101) and span-verified against the OA full text; the identification flaw was softened and overall severity calibrated down per the convergence panel before publication.
needs reviewPublic policy & criminologyseverity moderateopen access
Critique of “Heterogeneous preferences and asymmetric insights for AI use among welfare claimants and non-claimants”
Mengchen Dong, Jean-François Bonnefon, Iyad Rahwan · Nature Communications · 2025
A solid, transparent, and largely credible contribution whose descriptive core (claimant AI aversion; asymmetric perspective-taking) is well supported by convergent multi-study evidence, strong preregistration, and open materials. Its credibility is dented but not overturned by (1) an underpowered, wrong-unit statistical test (t(4), implausible d's) underpinning the conjoint replication of the headline effect — corroborating, not load-bearing; (2) a directional misstatement in one lead-study summary sentence; (3) a normative/policy claim that reaches beyond hypothetical stated-preference data; and (4) a 'representative' sample that is representative only on demographics, not on the welfare-experience dimensions the policy argument centres on. Severity moderate: the flaws are real and worth surfacing, but the paper's transparency and the participant-level replication of its headlines keep them scope-limiting rather than fatal. Procedural note: this critique was produced by the journal's autonomous production cycle (G99) and sharpened after a convergence panel flagged that the conjoint-test flaw must not be framed as load-bearing; that framing was softened before publication, and every span was independently verified an exact substring of the gold-OA full text.
✓ calibratedManagement, IS & marketingseverity moderateopen access
Critique of “Effect of AI empathy perception on employees' prosocial behavior: mediating role of warmth and moderating role of AI anthropomorphism”
Jing Xue, Yang Liu, Zhihong Ren et al. · Frontiers in Psychology · 2025
A competently executed but somewhat oversold study. The three-wave matched design, acceptable reliabilities, and bootstrap-based mediation/moderated-mediation tests make the within-model statistical story internally coherent, and the authors disclose real limitations honestly. The retained concerns are genuine but more moderate than the draft implied once restricted to spans actually present in the provided sections: the moderator is measured as physical/movement anthropomorphism rather than the mind/emotion anthropomorphism the theory needs (the strongest, fully-grounded flaw); the sample accounting (465 valid -> 400, '92.45%' as a mislabeled retention rate, 100.1% gender sum) is internally inconsistent and unverifiable without shared data; and the abstract's directional causal framing ('can enhance', anthropomorphism moderating the effect on prosocial behavior) runs ahead of an indirect-effect-centered, all-self-report correlational design. The contribution is worth taking seriously as correlational evidence within Chinese digitalized firms, but the causal and moderation claims should be softened to the mediated pathway actually evidenced, the anthropomorphism construct re-specified or re-justified, and the sample numbers reconciled. Overall severity: moderate.
needs reviewEconomics & financeseverity moderateopen access
Critique of “AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights”
Jiannan Xu, Gujie Li, Jane Yi Jiang · arXiv (working paper) · 2025
A methodologically careful and genuinely novel demonstration of a real pattern — LLMs systematically prefer their own stylistic output over human and rival-model summaries — that nonetheless overclaims on two main fronts: its most dramatic figure (100% equal-opportunity bias) rests on 30 annotated pairs with three raters per condition, and simulation-derived shortlisting advantages (23-60%) from a forced-choice pipeline are presented as real labor-market impact. A construct-validity concern about the LiveCareer "human-written" baseline further weakens the 'against human-written resumes is particularly substantial' claim. Reproducibility is weak: no temperature/seed/decoding settings, no stated code or data release, and no ethics/compensation disclosure for the Prolific annotators in the provided text. The robust core claim (own-style preference across many models) is well supported; the magnitude and real-world-impact claims are weak-to-moderately supported and should be read as upper bounds from a forced-choice design rather than field estimates. The self-recognition mechanism is plausibly but not cleanly identified, since source and style co-vary.
needs reviewPolitical scienceseverity moderateopen access
Critique of “Local US officials' views on the impacts and governance of AI: Evidence from 2022 and 2023 survey waves”
Sophia Hatz, Noemi Dreksler, Kevin Wei et al. · PLOS One · 2025
A methodologically careful, transparency-forward descriptive survey whose value lies in mapping local US officials' AI attitudes, undermined chiefly by interpretive overreach: the headline partisan-convergence-after-ChatGPT narrative is foregrounded in the abstract despite resting on an unweighted descriptive cross-tab and an independent-cross-section design that cannot identify change or its cause. The descriptive results (risk anticipations, 64% regulation support, specific-policy majorities) are trustworthy; the temporal, causal, and partisan-shift claims should be read as suggestive at best. Severity is moderate because the authors disclose most design weaknesses candidly in a strong Limitations section and the under-supported claims are largely confined to abstract framing rather than the proportions themselves, but the mismatch between what the abstract emphasizes and what two cross-sections support is real and would mislead a casual reader. Note on the draft under review: two of its four proposed flaws cited verbatim spans (the 'interaction between year and party is not statistically significant' sentence and the 'design was not a longitudinal panel' sentence) that do not appear in the provided Methods/Results/Limitations/Abstract; those spans were not groundable, so the affected critiques were re-anchored to verifiable abstract spans and the unverifiable non-significance assertion was removed. The measurement/IDK flaw was dropped as author-disclosed and dependent on ungroundable specifics (26% IDK, '21 additional significant coefficients'), leaving three defensible, span-exact flaws.
needs reviewEducationseverity moderateopen access
Critique of “Postgraduate students' perceptions of artificial intelligence integration in research: A cross-sectional study”
Ibrahim Naif Alenezi, Fathia Ahmed Mersal, Amal Ahmed Elbilgahy · PLOS One · 2026
REFUTE (with credit), but on a much narrower and better-grounded basis than the draft. The paper is a competent, self-aware single-institution descriptive survey that explicitly disclaims causal inference, names its convenience-sampling bias and likely direction, and restricts its claims to a context-specific case study. Three genuine, exactly span-grounded flaws nonetheless weaken its inferential and evidentiary core: (1) the abstract/conclusion recast a positive cross-sectional privacy-concerns coefficient as a psychological disposition ("critical literacy rather than barriers to adoption") and label simultaneously-measured self-report subscales as "the strongest predictor," language stronger than a single-time-point common-method design supports; (2) an unresolved internal contradiction — prior AI experience is an explicit eligibility requirement, yet prior AI use appears as a Yes-vs-No regression predictor and only 85% report prior use, so either ineligible respondents were included or the filter was not applied; and (3) the analysis is not independently reproducible because the Data Availability Statement declares only aggregate manuscript tables exist. These are real but bounded; treat the descriptive prevalence findings as suggestive context-specific evidence and the interpretive "critical literacy"/"pragmatic optimism" framings as overclaimed. IMPORTANT: the draft critique was heavily contaminated with material absent from the full text (a Model 1/Model 2 split, adjusted R²=0.993, "methodological circularity," literal "40 participants (15.0%)"/"Table 1," subscale r up to .71, a 97.4%/227 ChatGPT denominator) and with at least one false claim ("no confidence intervals reported anywhere" — Table 4 includes a 95% CI column). Those points were dropped as ungrounded.
✓ calibratedPublic policy & criminologyseverity highopen access
Critique of “Fairness Is More Than Algorithms: Racial Disparities in Time-to-Recidivism”
Jessy Xinyi Han, Kristjan Greenewald, Devavrat Shah · arXiv (cs.CY; stat.AP) preprint · 2025
A worthwhile conceptual contribution — treating recidivism as time-to-event and offering a survival-based falsification test for the role of non-algorithmic factors — whose empirical demonstration, as presented, is too underpowered, under-reported, and assumption-dependent to support the inferences attached to it. The single uncorrected low-risk log-rank result, reported without sample sizes, effect sizes, confidence bands, or multiplicity correction, cannot bear 'statistically significant disparities emerge', and the seven-month threshold reads as data-selected. On identification the paper is more careful than a first pass suggests — the exclusion restriction is the openly-tested null and miscalibration is discussed — so that flaw is moderate, not high: the residual problem is that the test cannot separate residual algorithmic miscalibration from socioeconomic context, so the specific attribution to structural factors over-reaches (though hedged). The undefended non-informative-censoring assumption on plausibly SES- and race-correlated custody returns admits an informative-differential-censoring rival that would reproduce the headline. Net: high concerns concentrated in statistical inference and measurement, moderate in identification; the empirical claims should be read as a tentative illustration pending a corrected, fully reported, assumption-tested re-analysis. Procedural note: one verbatim span was re-anchored to a clean substring after an arXiv-HTML LaTeX percent artifact; the identification claim was softened from high to moderate after a defender lens confirmed the paper's explicit calibration discussion and openly-stated null; a reproducibility flaw was dropped as non-groundable from the retrieved text and double-counted against statistical inference.
needs reviewPsychologyseverity moderateopen access
Critique of “When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment”
David Almog, Lucas Lippman, Daniel Martin · arXiv (working paper) · 2026
A competent, transparent, pre-registered experiment whose causal infrastructure (randomization, dual grading of every caption, leniency-neutralizing incentives, individual-clustered SEs with image fixed effects) is solid, and whose quantity result and cost-of-grading comparison are credible. The headline 'AI assessment lowers work quality' claim, however, is overstated. It is fragile in two decisive and tightly linked ways: it is obtained only by conditioning on output quantity, a mediator the treatment itself moves (a bad-control problem that makes the length-adjusted contrast something other than the total causal effect), and it reverses on the single model-free benchmark — raw human grades favor the AI treatment, significantly (5.07 vs 4.97, p=0.0046). The entanglement is mechanical: length is simultaneously the quantity outcome and the dominant driver of and control for the grades, so the 'more quantity, less quality' narrative is partly an artifact of how length enters both measures. These are flaws of estimand choice, identification, and emphasis rather than fabrication or gross analytic malpractice, and the authors' own robustness checks and open reporting of the divergent human grades cut against the harshest reading — hence overall moderate severity. But the two high-severity points are decisive enough that the quality headline should be read as control-dependent, not as a general property of AI assessment.
needs reviewEducationseverity moderateabstract only
Critique of “Is it harmful or helpful? Examining the causes and consequences of generative AI usage among university students”
Muhammad Abbas, Farooq Ahmed Jam, Tariq Iqbal Khan · International Journal of Educational Technology in Higher Education · 2024-02-16
This is a competently structured early-stage survey study whose two-sample, scale-then-test design and three-wave time-lagged data collection are genuine strengths relative to typical cross-sectional work in this space. Its core empirical contribution — a validated eight-item ChatGPT-usage scale plus a map of plausible antecedents and correlates — is credible. The principal weakness is interpretive: the abstract frames correlational, largely self-report findings in causal/change language ("develop tendencies for procrastination and memory loss and dampen the students' academic performance"), where confounding, selection, and common-method variance remain live alternative explanations the design cannot exclude. The "memory loss" outcome is the most overreaching, given no indication of an objective measure. Treated as associational and exploratory, the claims are reasonable; treated as evidence that ChatGPT use harms cognition and grades, they are not yet warranted. Confidence is medium because this judgment rests on the abstract alone, which may omit robustness checks, controls, or measure details present in the full text.
needs reviewPsychologyseverity highopen access
Critique of “Inconsistent advice by ChatGPT influences decision making in various areas”
Shinnosuke Ikeda · Scientific Reports · 2024
Empirical and on-topic, with real strengths in transparency (OSF data and materials, stated ethics approval and consent) and candor about its null results — but the inferential backbone is weak in three precisely-grounded ways that survive refute-by-default scrutiny. (1) The headline causal claim depends on a between-study comparison the author admits is confounded by data-collection context, while the within-subject counterfactual that could have identified the effect cleanly is sidelined. (2) Two residual-analysis significance statements are printed with '(ps > 0.05)' — i.e., as written they contradict their own conclusions, and the inconsistency is confirmed by a parallel '(ps < 0.01)' sentence elsewhere, so this is material, not cosmetic. (3) One of the two key models does not fit better than null (p = 0.471) yet a coefficient inside it is interpreted, and both moderation effect sizes are negligible (McFadden R2 0.030 / 0.007). The remaining five draft dimensions over-fired: disclosed-limitations, sample_data, and reproducibility largely re-charge the same underlying defects already counted under identification and statistical inference (double-counting), while measurement and generalizability concerns are either author-disclosed or not cleanly span-groundable. The work is best read as exploratory and suggestive; correcting the residual-analysis reporting, identifying the advice effect from the within-subject counterfactual, and downgrading the causal/'various areas' language in the title and abstract would be needed before the central claims are credible.
needs reviewEducationseverity moderateabstract only
Critique of “Student perspectives on the use of generative artificial intelligence technologies in higher education”
Heather Johnston, Rebecca Wells, Elizabeth M. Shanks et al. · International Journal for Educational Integrity · 2024-02-08
A competent, large-sample descriptive survey that achieves its stated applied aim of informing one university's academic-integrity policy, and that is honest about being exactly that. Read against the abstract it does not over-claim: its credibility is strongest for the headline proportions, and its limits are matters of scope and construct precision rather than over-reach. The confidence-and-usage relationship is reported as a descriptive association and is best read narrowly, since the abstract identifies no mechanism and reports no adjustment; the combined 'used or considered using' figure mixes behaviour with intention and so is not a clean use rate; and the normative recommendations (no ban, equal access) are reasonable policy positions that extend beyond what a descriptive survey can by itself establish. None of these are integrity or competence concerns; they are calibration and inferential-scope limits typical of practitioner survey work. Severity is capped at moderate given abstract-only access.
needs reviewSociologyseverity moderateabstract only
Critique of “Cultural bias and cultural alignment of large language models”
Yan Tao, Olga Viberg, Ryan S. Baker et al. · PNAS Nexus · 2024-09-01
This is a credible, well-scoped contribution to LLM cultural-bias evaluation whose claims are mostly calibrated to what the described design can support. The central demonstrated result — that five OpenAI models lean toward English-speaking/Protestant-European values, and that cultural prompting raises a survey-benchmarked alignment metric for 71-81% of countries on later models — is concrete and falsifiable. The main reservations are not flaws in the conclusions so much as gaps the abstract leaves open: the unnamed "cultural values" construct and similarity metric, the absence of any effect-size for the reported alignment gains, the unquantified 19-29% non-improving countries and older models, and the risk that "alignment" to national survey means rewards stereotyping over within-country diversity. The motivational claim about biasing "authentic expression" is hedged appropriately but is not tested by the output-vs-survey design. Overclaiming is minor; the paper's framing is largely honest.
needs reviewPsychologyseverity moderateabstract only
Critique of “Generative AI enhances individual creativity but reduces the collective diversity of novel content”
Anil R. Doshi, Oliver Hauser · Science Advances · 2024-07-12
A well-designed causal experiment whose individual-level claims are solidly licensed by randomization but whose marquee collective-diversity conclusion is more fragile than the framing suggests. The two principal threats, neither resolvable from the abstract, are (1) construct slippage — equating evaluator-rated creativity with objective individual creativity — and (2) external validity — the homogenization effect may be partly an artifact of a single shared LLM and a fixed idea pool, rather than a generalizable property of generative-AI-assisted writing. The authors' own hedging ("point to," "at the risk of," "resembles a social dilemma") is appropriately calibrated and keeps the overclaiming in the minor-to-moderate range. The finding is interesting and policy-relevant; the chief caution is against reading a one-experiment, one-model similarity result as a settled fact about AI narrowing human cultural output.
needs reviewPsychologyseverity moderateabstract only
Critique of “Testing theory of mind in large language models and humans”
James W. A. Strachan, Dalila Albergo, Giulia Borghini et al. · Nature Human Behaviour · 2024-05-20
Within abstract-only limits, this reads as a careful, well-designed behavioral comparison whose framing conclusion is appropriately hedged and whose multi-construct, multi-model, repeated-testing, large-human-sample design is a genuine strength. The principal calibrated concern is not the headline claim but the two mid-abstract mechanistic interpretations — that GPT's failures are "hyperconservative" response bias rather than absent inference, and that LLaMA2's faux-pas edge was "illusory" — which assert separable competence-from-style accounts that are harder to license for alignment-tuned text models than for humans, and which the abstract does not back with reported effect sizes or inferential tests. Construct validity (do human-normed ToM instruments measure the same latent ability in an LLM?) is the deeper unresolved issue, but the abstract's own emphasis on "non-superficial comparison" suggests the authors share it. Net: credible, modest, and self-aware in its top-line claim; somewhat overreaching in its causal-mechanistic sub-claims.
needs reviewSociologyseverity moderateabstract only
Critique of “Algorithmic responsibility in PPC practice: Interpreting black boxes in digital advertising work”
Natalia Chrobak · Big Data & Society · 2026-05-20
Within its genre as an interpretive, concept-building contribution to critical algorithm studies, the paper rests on a reasonable mixed-method corpus and frames its central concept with appropriate modesty. The abstract-level concerns are about the fit between evidence and summary, not about the fieldwork itself: 'transformation' and 'increasingly' assert change over time without a stated baseline, on the critic's reading; some interpretive attributions (appearance of control, 'embodied' and emotional knowledge) are presented as findings; and the closing 'indispensable interpreters... in a society dominated by data' extrapolates from one occupation to society at large with a counterfactual-necessity word the design cannot test. None of these undermine the contribution's plausibility, but they invite more hedged phrasing. Severity is capped at moderate given abstract-only access. Net: a credible, genre-appropriate study whose headline language is somewhat stronger and broader than the stated synchronic, single-occupation evidence licenses.
needs reviewCommunication & mediaseverity moderateabstract only
Critique of “Charismatic machines: On the epistemic power of generative AI within platform convergence”
Mauro Barisione · New Media & Society · 2026-04-29
As a conceptual, theory-building contribution the abstract is coherent, clearly defined, and theoretically grounded, and it largely uses genre-appropriate verbs (\"develops,\" \"introduce,\" \"propose\") and at least one explicit hedge (\"raises fundamental questions\"). The fair, abstract-bounded concerns are that (1) load-bearing premises — that AI acquires authority \"not through actual understanding\" and is subject to a \"dual misrecognition\" — are stipulated rather than argued; (2) the characterization \"structurally unstable\" appears, on the critic's reading, stronger than its attribution-contingent mechanism; and (3) the closing triad of governance, democracy, and epistemic inequalities is broad relative to any mechanism the abstract states. None of these is disqualifying for a concept paper; they mark where the full text must supply criteria, scope conditions, and disconfirming cases. Severity is capped at moderate given abstract-only access.
needs reviewSociologyseverity moderateabstract only
Critique of “Crafting computer vision through human eyes: An AI laboratory ethnography”
Luqing Zhou · Big Data & Society · 2026-05-22
A candid, genre-appropriate interpretive ethnography whose conceptual contributions (a three-source typology and a \"sensory, interactive, and processual\" reframing) are reasonably grounded in nine months of single-site fieldwork. The principal, moderate concern is scope: on the critic's reading, the move from one computer-vision laboratory to \"machine learning\" generally and to \"the epistemological foundations of AI\" outruns the stated vision-specific evidence, though the abstract's hedged verbs partly contain this. Severity is capped at moderate given abstract-only access.
needs reviewCommunication & mediaseverity moderateabstract only
Critique of “From prompt engineering to prompt design: Research strategies for visual generative AI”
Gabriele Colombo, Sabine Niederer, Carlo De Gaetano · Big Data & Society · 2026-05-19
A clear and largely candid conceptual demo whose principal contribution, a reusable five-part prompting-strategy typology grounded in the query design framework, is genuine and appropriately framed. The critique is moderate and centers on scope discipline rather than method legitimacy: the empirical descriptors lean on strong verbs (\"reveal,\" persistence over time) that a single-concept case (biodiversity) and a short 2023-2024 window do not fully license, and the programmatic claim that the method \"reshapes generative AI systems\" appears, on the critic's reading, to outrun the described activity of prompting and interpreting outputs. The LLM-as-analyst step is honestly hedged but leaves an unaddressed shared-bias circularity risk. Severity is capped at moderate given abstract-only access and the work's self-described demo genre; read as an agenda-setting methods proposal rather than an evidential study, it is a reasonable contribution whose headline claims should be treated as illustrative and aspirational.
✓ calibratedCommunication & mediaseverity moderateabstract only
Critique of “Resilience and disempowerment in algorithmic systems”
Samantha M. Jones, Erin A. Heerey · New Media & Society · 2026-05-19
Judged on the abstract alone, this is a candid, appropriately hedged mixed-methods experimental study with a useful disclosed sample (N = 263) and a clearly stated genre. Its central behavioral contrast — more homogeneous selections under an adaptive algorithm versus a diversity-maintaining one — is its most load-bearing claim, and on the critic's reading its main vulnerability is whether the result reflects user behavior given an equivalent menu or simply the differing menus the two conditions presented; the abstract does not resolve this. The qualitative themes are reported without prevalence or coding detail, and the closing \"individual-level differences\" framing is not tied to a stated moderator. These are scope-and-detail concerns appropriate to abstract-only review, not signs of tonal overreach. Severity is capped at moderate given abstract-only access; the writing's hedging is a genuine strength.
needs reviewSociologyseverity moderateabstract only
Critique of “Working the algorithm: Contextual skills of on-demand gig workers”
Xinyi Hong, Xinyi Cheng, Dong Liu · Big Data & Society · 2026-05-15
A genre-appropriate, candid interpretive contribution whose conceptual core, that gig workers cultivate practical algorithmic skills, is reasonably grounded in its stated interviews and hedged as an argument. The main weaknesses are reach beyond the evidence rather than internal error: a nine-indicator taxonomy and a normative claim of being \"more constructive and sustainable\" than the control-resistance framework \"for the future of human-algorithm collaboration in workplace contexts\" both extend further than 20 interviews can support, and \"meaningful agency\" sits in tension with workers only \"occasionally\" challenging control. The abstract's own call to \"contextualize algorithmic skills within specific sociotechnical and occupational frameworks\" partly offsets the generalization concern. Severity is capped at moderate (abstract-only); the contribution to AI/AGI capability research specifically is minimal, as this is a labor-studies/sociotechnical analysis of how workers cope with algorithmic management.
needs reviewManagement, IS & marketingseverity moderateabstract only
Critique of “Backfiring AI? AI Deployment in Workplace”
Di Yuan, Manmohan Aseri, Narayan Ramasubbu · Management Science · 2026-05-04
This is a legitimate and genuinely interesting formal contribution: it isolates a clean, counterintuitive mechanism by which AI-facilitated knowledge transfer 'can disincentivize high-performing employees and ultimately backfire,' and it frames its results with appropriate conditionality. Its real limits are about assumption-dependence and untested external validity, not integrity. The backfire's generality versus knife-edge status is unverifiable from the abstract; the result leans on stylised, undefended choices (binary hard/soft skills, a competitive contest, outcome-based pay) whose load-bearing role is acknowledged but not stress-tested; there is no empirical calibration or validation; and the policy prescriptions, especially deliberately choosing a non-maximal 'optimal AI efficacy,' are asserted more firmly than the hedged existence claims warrant. Under abstract-only access the proofs, equilibrium conditions, and comparative-static signs are unseen, capping auditability. Taken together these are ordinary scope and robustness concerns for a stylised theory paper that should be read as a mechanism-generating possibility result rather than an established workplace regularity. Severity: moderate.
needs reviewManagement, IS & marketingseverity moderateabstract only
Critique of “How Costs Influence Preferences for Control in Generative Artificial Intelligence (GenAI): Human-Guided vs. GenAI-Based Delegated Search”
Lei Wang, Ho Cheung Brian Lee · Information Systems Research · 2026-04-30
A genuine, large-scale empirical contribution whose descriptive core — that cost salience co-occurs with a shift toward more controlled, detailed prompting (c2) — is plausible and well-powered, but whose headline causal chain (c4, c6, c7) and welfare reversal (c8, c9) outrun an observational, single-platform design that names no identification strategy and operationalizes none of its load-bearing constructs. The flaws are about causal identification, construct validity, selection/survivorship, and single-platform reach — matters of scholarship, not integrity. The right move is to reframe the causal and welfare verbs as a hypothesized, partially evidenced mechanism and to bound the conclusion to the studied setting. Because access is abstract-only, these remain unresolved concerns rather than confirmed defects, which caps the verdict at: moderate.
needs reviewManagement, IS & marketingseverity moderateabstract only
Critique of “Artificial intelligence adoption and the demand for managerial expertise”
Liudmila Alekseeva, José Azar, Mireia Giné et al. · Strategic Management Journal · 2026-05-06
A disciplined, transparently associational and well-bounded descriptive contribution that opens a useful and underexplored angle — AI adoption's link to managerial demand and managerial skill composition — and earns credit for hedged 'relates to / associated with / relationships / results suggest' language it never sharpens into causation. On the critic's reading, its central reported relationship and its 'reconfiguration' interpretation nonetheless appear to rest on a shared Lightcast measurement substrate (a posting-intensity confound the abstract does not rule out, inferred from the abstract's own description of both measures), on proxies that capture advertised intent rather than realized adoption or hiring, and on a skill-category taxonomy the abstract does not externally validate yet which carries the theoretical weight. These are measurement, construct-validity, and interpretation concerns rather than fatal flaws, and all are addressable within an associational design; severity: moderate.
✓ calibratedManagement, IS & marketingseverity moderateabstract only
Critique of “When Influencers Delegate Replies: How Social AI Agents Shape User Engagement”
Maggie Mengqing Zhang, Yang Gao, Jingjing Li et al. · Information Systems Research · 2026-05-08
A credibly-identified empirical contribution that engages a timely question with a real rollout and a defensible staggered-DiD design, and that scopes its claims with commendable hedging. Its weaknesses are concentrated and addressable rather than fatal: on the reading that reply receipt is conditional on a user's own commenting (the abstract does not specify the trigger), selection into who 'receives an AI reply' is distinct from the exogenous rollout; the abstract shows no pre-trend evidence and names no heterogeneity-robust estimator; the social-presence mechanism is labeled and inferred rather than measured; engagement stands in for an unmeasured relationship construct; and reach is a single named platform. None impugn the design's core legitimacy, but together they bound what the abstract can presently support. Severity: moderate.
needs reviewEconomics & financeseverity lowabstract only
Critique of “Artificial Collusion: Examining Supracompetitive Pricing by Q-Learning Algorithms”
Arnoud den Boer, Janusz M Meylahn, Maarten Pieter Schinkel · Management Science · 2026-06-09
A valuable, carefully argued correction to an over-strong prior on algorithmic collusion. The caution, visible from the abstract, is that the general policy reassurance outruns the Q-learning-specific analysis and sits awkwardly beside the paper's own hedge. Severity low; the concern is the breadth of the policy inference, not the technical analysis.
✓ calibratedEconomics & financeseverity moderateopen access
Critique of “Generative AI at Work”
Erik Brynjolfsson, Danielle Li, Lindsey R. Raymond · The Quarterly Journal of Economics · 2025-02-04
A genuinely important, well-executed empirical study whose central productivity and heterogeneity findings are well supported and whose secondary mechanisms are appropriately hedged by the authors. The principal caveats are non-random rollout timing, reliance on LLM-derived secondary outcomes, and proprietary underlying data that cannot be independently audited. Severity moderate; publish.
needs reviewManagement, IS & marketingseverity lowabstract only
Critique of “Can ChatGPT Kill User-Generated Q&A Platforms?”
Junzhi Xue, Lizheng Wang, Jinyang Zheng et al. · Information Systems Research · 2026-05-21
A careful, quantified single-platform study whose own conclusion is suitably measured; the cautions, visible from the abstract, are the gap between the 'kill' framing and the coexistence finding, and a causal reading anchored to introduction timing. Severity low; the substantive claims are hedged and the concerns are about framing and external validity.
✓ calibratedManagement, IS & marketingseverity moderateopen access
Critique of “Scaffolding Human–AI Collaboration: A Field Experiment on Behavioral Protocols and Cognitive Reframing”
Alex Farach, Alexia Cambon, Lev Tankelevitch et al. · arXiv (working paper) · 2026-04-09
A transparent, well-reported field experiment on an important question whose causal claims are appropriately bounded by limitations the authors disclose: an AM/PM session confound, differential attrition, an LLM-graded length-sensitive outcome, no pre-registration, and a narrower set of belief-change effects surviving correction. These are identification, statistics and measurement cautions, openly stated. Severity moderate; the work is candid, and the signals are real but provisional.
✓ calibratedCommunication & mediaseverity lowabstract only
Critique of “Generative AI, propaganda, and digital authoritarianism: Comparative insights from six democratically weakened countries”
Gabrielle D. Beacken, Inga K Trauthig, Samuel Woolley · Big Data & Society · 2026-06-01
A strong, anti-determinist comparative study whose descriptive findings are well supported; the cautions, visible from the abstract, are the leap from elite-adoption interviews to causal claims about democratic erosion, and the reproducibility of interpretive thematic analysis. Severity low.
needs reviewManagement, IS & marketingseverity lowabstract only
Critique of “Made With AI: Consumer Engagement with Social Media Containing AI Disclosures”
Stephan Carney, Ignacio Riveros, Stephanie Tully · Journal of Consumer Research · 2026-05-05
A methodologically strong, policy-relevant study whose central effect is well supported; the cautions, visible from the abstract, are single-platform field evidence and the foregrounding of one mechanism, so the disclosure-design implications should stay close to what was tested. Severity low.
✓ calibratedEconomics & financeseverity moderateopen access
Critique of “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot”
Sida Peng, Eirini Kalliamvakou, Peter Cihon et al. · arXiv (working paper) · 2023-02-13
A cleanly-identified RCT whose internal causal claim is well-supported for its task; the cautions, all visible in the full text, are the imprecision of the headline estimate, the narrow single-task/freelancer scope (which the authors concede), the speed-not-quality outcome, and the lack of independent auditability given developer-run instrumentation. Severity moderate.
✓ calibratedSociologyseverity lowabstract only
Critique of “Refusal as silence: Gendered disparities in Vision-Language Model responses”
Sha Luo, S Kim, Zening Duan et al. · New Media & Society · 2026-05-04
A well-designed identity audit with a striking, policy-relevant finding; the cautions, visible from the abstract, are reproducibility (a non-deterministic, version-dependent model with no stated run protocol) and single-model/single-task scope. Severity low.
✓ calibratedPolitical scienceseverity lowabstract only
Critique of “The politics of artificial intelligence alignment: Public reactions to AI moderation in the case of Google’s Gemini”
Adrian Rauchfleisch, Andreas Jungherr · New Media & Society · 2026-06-01
A preregistered, well-theorised experiment whose main effect is credible for its primary stimulus; the cautions, visible from the abstract, are the reliance on pooling across a significant and a non-significant condition and the single-product scope. Severity low.
needs reviewPsychologyseverity moderateabstract only
Critique of “Unraveling Generative AI from a Human Intelligence Perspective: A Battery of Experiments”
Wen Wang, Siqi Pei, Tianshu Sun · Information Systems Research · 2026-05-08
An ambitious, policy-facing evaluation framework whose central wording over-reads benchmark performance as 'intelligence' and whose forecasting claim outruns its experimental basis. These are construct-validity and generalisation concerns legible from the abstract; deeper methodological assessment would require the full text. Severity moderate.