Post-publication Comment · Critical AI
Comment on “More Versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review”
Critical AI · published 2026-06-21 · v1.0 · CRIT-GEN-more-versus-better-artif
Concerning: Claudine Madras Gartenberg, Sharique Hasan, Alex Murray, Lamar Pierce · Organization Science · 2026-04-27
Why this paper was selected
Selected via the production queue; critique generated by the AGISS engine.
AI/AGI centrality 2/5 · societal relevance 4/5 · source-journal note: Tier S per the determination; ingested from an AGISS critique artifact.
Summary
An Organization Science task force reports that submissions to their journal rose 42% after ChatGPT's late-2022 release while writing quality fell, attributes 'nearly all' of this to AI-generated writing, and finds AI-written peer reviews are lower-quality and less topically diverse than human ones. The descriptive signal is timely and the authors are admirably candid that they cannot yet judge 'ideal' AI usage. But the abstract never describes how text was classified as AI-generated, and the same classification underpins both the 'AI' label and the quality/diversity outcomes — so the AI-vs-human gaps could be partly built into the measurement. The 42% figure is a single-journal before/after that does not, in the abstract, separate ChatGPT from coincident trends, and the claim that the problem is field-wide rests on informal editor conversations. The 'equilibrium of more rather than better research' is a strong framing the data underdetermine. Verdict: a useful early alarm whose causal and systems-level wording outruns the abstract's stated evidence; the online appendix would be needed to verify the detection method.
Central claims & evidence map
| Claim | Type | Evidence offered | Support | Overclaiming | Main weakness |
|---|---|---|---|---|---|
| Submission volume at the journal has risen 42% since the late 2022 release of ChatGPT, while writing quality has declined. | "Submission volume has risen 42% since the late 2022 release of ChatGPT, while writing quality has declined." | Weak | Moderate | A single-journal pre/post time series cannot, from the abstract, separate the ChatGPT release from coincident secular trends in submissions and reviewing practices. | |
| The rise in AI-generated writing accounts for nearly all of these trends. | Causal | "The rise in AI-generated writing accounts for nearly all of these trends." | Weak | Major | Attribution to 'AI-generated writing' depends on an undescribed detection method whose accuracy and overlap with the quality measure are not reported, risking circularity. |
| AI-generated writing in reviews has increased and is characterized by lower writing quality and less topical diversity than human-generated writing. | "AI-generated writing in reviews has also increased, and is characterized by lower writing quality and less topical diversity than human-generated writing." | Weak | Moderate | The AI-vs-human contrast is between classifier-defined groups, so detector error could generate the reported quality and diversity differences by construction. | |
| This is, to the authors' knowledge, the first journal to report these early impacts of AI in the review process. | "We are, to our knowledge, the first journal to report these early impacts of AI in the review process." | Moderate | Minor | A self-limited priority claim that, being first, lacks independent corroboration — acknowledged by the authors' own hedging. | |
| Conversations with editors across scientific disciplines suggest that what is observed is not limited to this journal or to the social sciences. | "Conversations with editors across scientific disciplines, however, suggest that what we observe is not limited to our journal or to the social sciences." | Weak | Moderate | Cross-disciplinary generality is supported only by informal editor conversations, an unsystematic and selection-prone source. | |
| At this early stage of AI adoption, the authors cannot make a normative assessment about appropriate or ideal levels of AI usage. | Normative | "At this early stage of AI adoption, we cannot make a normative assessment about appropriate or ideal levels of AI usage." | Strong | None | A candid disclaimer that sits slightly in tension with the evaluative 'more rather than better' framing in the same abstract. |
| The current state of AI tools, amplified by existing publish-or-perish incentives, appears to be pushing the system toward an equilibrium of more rather than better research. | Theoretical | "appears to be pushing the system toward an equilibrium of more rather than better research" | Weak | Moderate | An 'equilibrium' systems claim and an incentive-amplification mechanism are asserted as framing without a model or evidence of the incentive channel in the abstract. |
Per-claim assessment
c1. Submission volume at the journal has risen 42% since the late 2022 release of ChatGPT, while writing quality has declined.
The abstract reports a precise figure (42%) and a co-occurring quality decline at a single journal, but offers no measurement detail in the abstract — how 'writing quality' is operationalised, over what baseline window, or whether the comparison adjusts for secular submission growth that predates ChatGPT. On the critic's reading, anchoring the rise to 'since the late 2022 release of ChatGPT' frames a temporal coincidence as if it were attributable to ChatGPT, which the next sentence then asserts more strongly. As a single-journal time series, the specific threat is that the post-2022 window also coincides with field-specific trends (a tenure-clock catch-up, conference cycles, journal scope changes) that the abstract does not rule out.
c2. The rise in AI-generated writing accounts for nearly all of these trends.
This is the strongest causal sentence in the abstract and the one most exposed. 'Accounts for nearly all' attributes both the 42% submission rise and the quality decline to AI-generated writing, yet AI-generated writing is itself detected (presumably by a classifier the abstract does not describe) rather than observed, so the attribution rests on a measurement whose accuracy and false-positive profile are unstated. On the critic's reading, the specific channel that makes this fragile is detector-driven confounding: if the same classifier flags both 'AI writing' and 'lower-quality writing' from overlapping textual features, then 'AI accounts for the quality decline' risks being partly definitional rather than an independent finding.
c3. AI-generated writing in reviews has increased and is characterized by lower writing quality and less topical diversity than human-generated writing.
This compares AI-classified reviews to human-classified reviews on quality and topical diversity. The comparison is between two groups defined by the same classifier, so any systematic misclassification (e.g., fluent-but-formulaic human reviews labelled AI) would mechanically produce the observed quality/diversity gap. The abstract gives no inter-rater or validation evidence for the AI/human split, and 'topical diversity' is left undefined. On the critic's reading, the direction of the likely bias is toward exaggerating the human-AI gap, because classifiers keyed on stylistic uniformity will both flag text as AI and score it as less diverse.
c4. This is, to the authors' knowledge, the first journal to report these early impacts of AI in the review process.
This is a modest, well-hedged priority claim ('to our knowledge', 'early'). It is largely unfalsifiable from the abstract but carries little inferential weight, and the authors appropriately limit it. Little to critique here beyond noting that a first-mover descriptive report is, by nature, not yet corroborated by independent journals.
c5. Conversations with editors across scientific disciplines suggest that what is observed is not limited to this journal or to the social sciences.
The generalisation beyond the focal journal rests on informal 'conversations with editors', which the abstract presents as suggestive rather than systematic. The sampling of editors, their number, and how their impressions were elicited are unstated; such conversations are vulnerable to selection (editors who already perceive an AI problem are likelier to engage) and to confirmation framing. On the critic's reading, this is appropriately hedged ('suggest', 'not limited to'), so the concern is scope-of-evidence rather than overclaim — but it should not be read as multi-journal quantitative corroboration.
c6. At this early stage of AI adoption, the authors cannot make a normative assessment about appropriate or ideal levels of AI usage.
This is an explicit, commendable self-limitation: the authors disclaim the normative conclusion their descriptive findings might invite. It is candid and reduces the burden of critique on the normative front. The only tension is that the subsequent 'more rather than better' framing and the 'critical engine of innovation' aspiration are themselves lightly normative in tone, even as the sentence disclaims normativity.
c7. The current state of AI tools, amplified by existing publish-or-perish incentives, appears to be pushing the system toward an equilibrium of more rather than better research.
This is the abstract's central interpretive thesis and it is hedged ('appears to be'). However, 'equilibrium' is a strong systems-level term invoked from journal-level descriptive data on submissions and reviews; the abstract presents no model or dynamic evidence that the system is at or converging to an equilibrium rather than in transient adjustment during 'early' adoption. The 'amplified by existing publish-or-perish incentives' clause adds a causal moderator (incentives) that the abstract does not show was measured or varied. On the critic's reading, this is best read as a framing conjecture consistent with the descriptive findings, not a demonstrated equilibrium result.
Scorecard
Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.
What the abstract claims
The piece is an editorial/conceptual account from the Ai Task Force for Organization Science reporting early measured impacts of AI on a 'major academic journal'. Its descriptive core is concrete: 'Submission volume has risen 42% since the late 2022 release of ChatGPT, while writing quality has declined', AI-generated writing 'accounts for nearly all of these trends', and AI-generated review text has risen with 'lower writing quality and less topical diversity' than human text. Around this sits a priority claim (first journal to report such review-process impacts), a generalisation from 'conversations with editors', an explicit refusal to make a normative assessment, and an interpretive thesis that AI plus 'publish-or-perish incentives' is pushing toward an 'equilibrium of more rather than better research'. The genre is a descriptive-plus-framing editorial, and it should be judged as such: by transparency of measurement and the fit between data and the systems-level reading, not by causal-identification standards it never claims.
The load-bearing weakness: detector-defined variables
Nearly every empirical sentence depends on classifying text as 'AI-generated', yet the abstract never describes the detection method, its validation, or its error rates. This matters in one specific way rather than generally: the same classifier defines both the 'AI' group and is implicitly entangled with the 'quality' and 'topical diversity' outcomes. If the detector keys on stylistic uniformity, it will both label text AI and score it as lower-diversity/lower-quality, so the finding that AI writing 'is characterized by lower writing quality and less topical diversity' risks being partly definitional. The strong causal sentence — AI 'accounts for nearly all of these trends' — inherits this exposure: it attributes an observed quality decline to a latent, detector-estimated cause. On the critic's reading the likely bias is toward exaggerating the AI–human gap; the online appendix may resolve this, but the abstract alone does not.
Single journal, secular trends, and the 42%
The headline 42% rise is a single-journal pre/post comparison anchored to 'the late 2022 release of ChatGPT'. The specific identification threat is not generic confounding but the coincidence of the post-2022 window with field-level trends the abstract does not rule out — submission growth that predates ChatGPT, scope or policy changes at the journal, or post-pandemic catch-up. Anchoring the rise to ChatGPT's release frames a temporal coincidence as attribution, which the following sentence then hardens. The cross-disciplinary generalisation rests on 'conversations with editors', an informal and selection-prone source (editors already perceiving an AI problem are likelier to engage). Both points are about scope of evidence; the authors hedge the generalisation ('suggest'), and the single-journal design is a legitimate descriptive base — but not a basis for system-wide inference.
Framing versus disclaimer, and what is fair
The abstract is candid where it should be: it states plainly that 'we cannot make a normative assessment about appropriate or ideal levels of AI usage', which is creditable and limits the critique on the normative axis. The tension is that the evaluative thesis — an 'equilibrium of more rather than better research' — is a strong systems-level and lightly normative framing drawn from journal-level descriptive data, with no model and no demonstrated measurement of the 'publish-or-perish incentives' it names as the amplifier. Read charitably and with its own hedge ('appears to be'), this is a framing conjecture consistent with the data rather than a demonstrated equilibrium. The fair reading is that this is an early, transparent descriptive alarm whose interpretive overlay outruns the abstract's stated evidence, and whose key measurement (AI detection) needs the appendix to be auditable.
Strongest critique
The empirical backbone rests on classifying text as 'AI-generated' by a method the abstract never describes or validates, and that same classification is entangled with the very outcomes it is compared against — so the findings that AI writing has 'lower writing quality and less topical diversity' and that AI 'accounts for nearly all of these trends' risk being partly definitional rather than independent. Layered on a single-journal pre/post 42% rise that the abstract does not separate from coincident post-2022 trends, and a cross-discipline generalisation resting on informal editor conversations, the descriptive claims appear (on the critic's reading) biased toward overstating the AI–human gap, and the 'equilibrium of more rather than better research' reads as a framing conjecture the abstract's evidence underdetermines.
Strongest fair defence
As an early editorial from a journal's own task force, the piece does exactly what its genre licenses: it puts concrete, dated numbers on a phenomenon many suspect but few have measured, and it is unusually disciplined about its limits — explicitly declining a normative verdict, hedging its central thesis ('appears to be'), flagging its generalisation as merely 'suggested' by editor conversations, and pointing to an online appendix that presumably documents the detection method the abstract omits for space. The 42% rise and the human-versus-AI review contrasts are legitimate descriptive observations from privileged internal data no outside researcher could easily obtain, and naming 'publish-or-perish incentives' as an amplifier is a reasonable interpretive frame rather than a claimed causal estimate. Judged as a first-mover descriptive alarm rather than an identification study, it is candid and valuable.
Conclusion
A timely, candidly-hedged descriptive editorial whose value is the dated internal evidence it surfaces, but whose causal sentence ('accounts for nearly all of these trends') and systems-level framing ('equilibrium of more rather than better research') outrun what the abstract's stated evidence supports. The load-bearing weakness is that 'AI-generated' is a detector-defined variable whose method is undescribed and entangled with the quality/diversity outcomes it is compared against, so on the critic's reading the AI–human gaps are plausibly overstated; the 42% rise is a single-journal pre/post not separated from coincident trends, and cross-field generality rests on informal editor conversations. Severity capped at moderate (abstract-only); the online appendix would be needed to verify the detection method and the attribution.
Reply from the authors
Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.
Reply: not yet invited. No reply has been received for publication.
The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.
Source-grounding attestation
- ✓Verbatim source spans present in the critique — 6/6 provenance spans re-derived in the critique prose
- ✓Passes the publication validator — no errors
- ✓Zero fabricated citations — 0 fabricated
- ✓Severity within the access-basis cap — severity "moderate" ≤ cap "moderate" for abstract_only
Every verbatim span the critique relies on is re-derived in the prose in-app; span-in-source is re-verifiable offline (the abstract is re-fetched, not stored, per the no-reproduce policy).
Re-verify span-in-source offline: python3 scripts/verify-queue-critiques.py
Independent faithfulness review
A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.
All seven critique claims restate the abstract accurately and quote it verbatim. The critique consistently flags its key inference — that "AI-generated" writing is a detector-defined variable whose method is undescribed — as a presumption ("presumably by a classifier the abstract does not describe") rather than asserting the abstract said so. Every interpretive/causal concern (temporal-coincidence framing in c1, detector-driven confounding in c2/c3, equilibrium-as-conjecture in c7) is explicitly hedged with "on the critic's reading." c2's reading that "accounts for nearly all of these trends" covers both the 42% rise and the quality decline is well-supported by the antecedent sentence. c4, c5, and c6 are charitable and even credit the authors' hedging and self-limitation. No claim asserts beyond the abstract or strengthens/narrows what the paper says; the strongest-critique and final-judgment paragraphs likewise cap severity at moderate given abstract-only access and route verification to the appendix. No substantiated overreach or mischaracterization.
Version & correction history
| Version | Date | Change |
|---|---|---|
| v1.0 | 2026-06-21 |
No silent substantive corrections — every change is versioned and visible.
How to cite this Comment
Critical AI. Comment on “More Versus Better: Artificial Intelligence, Incentives, and the Emerging Crisis in Peer Review” (Claudine Madras Gartenberg et al., Organization Science, 2026). Critical AI; 2026. https://policywindow.org/critique/c/more-versus-better-artificial-intelligence-incenti
A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.
Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/more-versus-better-artificial-intelligence-incenti/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique more-versus-better-artificial-intelligence-incenti --live.
Content fingerprint 9a1379bd8ebe27af (v1.0) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.