{"$schema":"https://policywindow.org/critique/api/schema","name":"Critical AI — presentation & referencing benchmark","description":"How Critical AI's critiques READ and CITE, benchmarked against the standard real top-journal Comments set (compact flowing narrative, dense in-text citations, 6-15-work reference list, calibrated register). A refute-by-default panel scored six format dimensions per critique. Honest result: at-standard on compactness + hedging, strong on register, below on supporting-reference density and the AI-native titled-section structure.","docs":"https://policywindow.org/critique/presentation","run_date":"2026-06-22","cohort":5,"overall_bands":{"at_standard":0,"approaching":0,"below":5,"well_below":0},"dimensions":[{"name":"narrative_coherence","counts":{"at_standard":0,"approaching":3,"below":2,"well_below":0},"meanScore":1.6,"verdict":"mixed"},{"name":"academic_register","counts":{"at_standard":2,"approaching":3,"below":0,"well_below":0},"meanScore":2.4,"verdict":"strength"},{"name":"compactness","counts":{"at_standard":5,"approaching":0,"below":0,"well_below":0},"meanScore":3,"verdict":"at_standard"},{"name":"hedging_calibration","counts":{"at_standard":5,"approaching":0,"below":0,"well_below":0},"meanScore":3,"verdict":"at_standard"},{"name":"supporting_references","counts":{"at_standard":0,"approaching":0,"below":1,"well_below":4},"meanScore":0.2,"verdict":"gap"},{"name":"citation_grounding","counts":{"at_standard":0,"approaching":1,"below":3,"well_below":1},"meanScore":1,"verdict":"gap"}],"strengths":["academic_register","compactness","hedging_calibration"],"gaps":["supporting_references","citation_grounding"],"records":[{"critiqueId":"brynjolfsson-li-raymond-generative-ai-at-work-qje-2025","refs":4,"overallBand":"below","distinguishingTell":"Sparse, decorative referencing: a Comment-length piece that name-drops a half-dozen estimators and AI-model-derived metrics yet carries only 4 references (two named estimators missing from the list), essentially zero author-date in-text citations, and no DOIs — the opposite of the dense 1-3-cites-per-paragraph, 6-15-work grounding that defines a real top-journal Comment.","dimensions":[{"name":"narrative_coherence","band":"below","evidence":"The piece reads as a clean, logically ordered appraisal, but its eight labeled sections ('What the paper does','The headline productivity result','Who benefits','Mechanisms...','Language and convergence effects','Experience of work','Limits and external validity','Overall appraisal') reproduce a structured review/referee-report template, not the headerless flowing narrative typical of a QJE/AER Comment or a Nature Matters Arising. Real Comments rarely walk through every subsection of the target paper; they open with the contested claim and build a single argumentative arc. The section-by-section recapitulation ('What the paper does' before any critique) is a coherent but distinctly non-Comment shape."},{"name":"academic_register","band":"approaching","evidence":"Register is genuinely formal and field-appropriate: 'Identification leans on individual-level differences in adoption timing', 'survives Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna and Borusyak et al. estimators', 'mechanical mean reversion', 'skill-biased-technical-change literature'. Tight and unpadded. Lands just short of at_standard because the recap framing ('This is well identified for the setting') and tidy summary verdicts read slightly like an internal evaluator's scorecard rather than the adversarial scholarly voice of a published Comment."},{"name":"compactness","band":"at_standard","evidence":"At ~1150 words the length is appropriate for a Comment, and the prose is dense and economical: specific numbers carried inline (0.30, 15.2%, mean 1.97, Table II col.3, +0.5 RPH/36%, lowest quintile), estimator names listed without elaboration, no pedagogical padding. The STRONGEST CRITIQUE paragraph packs multiple distinct objections (low-powered outage variation, LLM-manufactured outcomes, non-random timing, proprietary data) into one tight sentence. This axis is indistinguishable from a real Comment."},{"name":"hedging_calibration","band":"at_standard","evidence":"Hedging is well calibrated and proportionate: 'largely rule out mechanical mean reversion', 'less robust than their prominence in policy debate implies', 'Severity moderate; publish', 'appropriately hedged by the authors'. Claims are graded rather than absolute, and the verdict distinguishes well-supported headline results from weaker secondary mechanisms. This matches the measured evaluative tone of a top-journal Reply."},{"name":"supporting_references","band":"well_below","evidence":"Only 4 references against the 6-15 standard for a Comment, and critically the in-text density is far below norm: the body names many methods and literatures (Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Borusyak et al., SiEBERT, Gemini scoring, skill-biased technical change, IV using office adoption dates) but supplies almost no author-date in-text citations grounding them. de Chaisemartin-D'Haultfoeuille and Borusyak et al. are invoked but absent from the reference list; SiEBERT/Gemini/SBTC are uncited. No DOIs/links. Density is roughly zero cites per paragraph versus the 1-3 standard."},{"name":"citation_grounding","band":"below","evidence":"The 4 listed works are individually on-point as concepts (Callaway-Sant'Anna and Sun-Abraham are exactly the DiD estimators named; Galton 1886 grounds 'mean reversion'; Polanyi 1966 grounds the tacit-knowledge/learning angle), so they are not hallucinated or off-topic. But grounding is shallow and uneven: the reference list omits two estimators it explicitly relies on (de Chaisemartin-D'Haultfoeuille, Borusyak et al.) while including a 19th-century and a mid-20th-century classic as the only non-econometrics support, and none of the load-bearing critique points (LLM-derived outcome validity, sentiment/embedding bias) are backed by any cited methodological literature. Citations decorate rather than support the argument."}]},{"critiqueId":"the-cybernetic-teammate-a-field-experiment-on-gene","refs":2,"overallBand":"below","distinguishingTell":"Near-absent referencing apparatus: only two works, zero in-text author-date citations, and a garbled Campbell & Stanley entry — versus the dense 6-15-reference, citation-per-paragraph fabric of a real top-journal Comment.","dimensions":[{"name":"narrative_coherence","band":"approaching","evidence":"The argument flows logically (design strength -> external-validity overreach -> self-report weakness) and the prose is clean. But it is delivered in three labelled sections ('What the paper does', 'Where the claim outruns the design', 'Self-reported social effects'), whereas real Comments typically run as continuous unheaded prose. The headers signal a templated structure rather than the seamless narrative of a published Reply."},{"name":"academic_register","band":"at_standard","evidence":"Register is formal and evaluative without being pedagogical: 'an external-validity step the single-site design does not support', 'vulnerable to novelty and demand effects'. Diction matches a sophisticated reader and avoids padding or tutorial tone."},{"name":"compactness","band":"at_standard","evidence":"At ~299 words with tight, claim-dense sentences and no filler, it is appropriately compact for the short-Comment form. Each sentence carries an evaluative point; nothing is restated."},{"name":"hedging_calibration","band":"at_standard","evidence":"Hedging is well-calibrated: it concedes the design strength explicitly ('a genuine methodological strength'), scopes the critique ('Our caution is narrower'), and grades severity ('the social reading is weaker than the performance result', 'Severity low'). Neither overclaiming nor mealy."},{"name":"supporting_references","band":"well_below","evidence":"Only 2 references total, against the 6-15 standard for a Comment, and there are zero in-text author-date citations in the section bodies. Methodological claims (external validity, demand effects, self-report bias) are asserted without any cited literature; the two references sit in a list but never appear to anchor a point in the prose. Density is far below the 1-3 citations per paragraph norm."},{"name":"citation_grounding","band":"below","evidence":"The two works are thematically plausible (Deaton & Cartwright 2018 on RCT generalisability; Campbell & Stanley on internal/external validity) and could in principle ground the external-validity point. But neither is tied to a specific claim in text, the Campbell & Stanley entry is mis-cited as an 'Encyclopedia of Educational Theory and Philosophy' article (it is the 1963 monograph), and nothing supports the demand/novelty-effect or self-report claims. Provenance is shaky and no DOIs/links are given."}]},{"critiqueId":"artificial-collusion-examining-supracompetitive-pr","refs":2,"overallBand":"below","distinguishingTell":"Sparse, detached referencing: only 4 works (below the 6-15 norm) with virtually no dense in-text author-date citations woven into the argument's methodological claims — a real Comment grounds nearly every paragraph with 1-3 in-text cites, whereas here the reference list floats apart from the prose.","dimensions":[{"name":"narrative_coherence","band":"approaching","evidence":"The argument has a clear through-line (paper deflates one alarm, then over-generalizes to policy) and the two-section logic is coherent. But real Comments typically run as a single flowing narrative without scaffolding headers like 'What the paper does' / 'Algorithm-specific result, general-sounding conclusion'; the explicit summarize-then-critique sectioning reads more like a structured report than a top-journal Comment's compact prose."},{"name":"academic_register","band":"approaching","evidence":"Register is largely formal and evaluative ('in tension with', 'held to the algorithm class actually studied'). However, the plain-language summary leaks evaluative-but-soft phrasing ('a scary idea', 'look under the hood', 'the debunking is valuable', 'Our one caution') that is closer to editorial/blog tone than the impersonal, terse register of a QJE/Nature Matters Arising Comment."},{"name":"compactness","band":"at_standard","evidence":"At ~258 words the piece is tight and non-padded; the core sections are dense and assume a sophisticated reader who knows Q-learning and the collusion literature. No pedagogical bloat in the technical sections. Length and economy are consistent with a short Comment."},{"name":"hedging_calibration","band":"at_standard","evidence":"Hedging is well-calibrated to a single modest claim: it credits the paper ('rightly deflates', 'careful and valuable'), confines the criticism to scope ('breadth of the policy inference, not the technical analysis'), and rates 'Severity low'. This proportionate, non-overclaiming stance matches the measured tone of a real Reply/Comment."},{"name":"supporting_references","band":"below","evidence":"Only 4 references against the 6-15 standard for a Comment, and they are not deployed as dense in-text author-date citations (1-3 per paragraph). The critique's actual prose carries essentially no in-text citations grounding its methodological points; the reference list sits detached from the argument rather than woven through it, which is the opposite of the citation-dense Comment norm."},{"name":"citation_grounding","band":"approaching","evidence":"The four works are on-topic and plausibly the right anchors (Watkins & Dayan 1992 for Q-learning; Klein 2021 and the OECD 2017 report for autonomous algorithmic collusion; Harrington 2018 for competition-law framing). They ground the domain, but none is cited at the specific point of the critique's central methodological claim (that conditions are 'irrelevant timescales' / require 'implausible synchronisation'), so on-pointness to the actual argument is partial."}]},{"critiqueId":"peng-copilot-developer-productivity","refs":1,"overallBand":"below","distinguishingTell":"The single textbook reference with no in-text author-date citations anywhere in the body — a real Comment would carry 6-15 targeted works cited densely throughout, whereas this asserts every methodological point uncited and bundles one mislabelled textbook into a reference list.","dimensions":[{"name":"narrative_coherence","band":"below","evidence":"The critique is broken into three labelled sections (\"What the paper does\", \"Precision and scope\", \"Quality and auditability\"), whereas top-journal Comments typically run as a single compact flowing narrative with no headers. The prose within sections is clean and logically ordered, but the rubric-style sectioning and the staged 'STRONGEST CRITIQUE / FINAL JUDGMENT' scaffolding read as a structured template rather than a continuous argumentative Comment."},{"name":"academic_register","band":"approaching","evidence":"Register is formal and evaluative ('the causal claim credible for this task', 'not independently auditable', 'far more uncertain and bounded than its ubiquitous citation suggests'), close to Comment tone. But phrasing like 'the much-quoted claim' and 'its near-universal citation' leans slightly toward science-journalism framing rather than the dry technical register of a Technical Comment."},{"name":"compactness","band":"at_standard","evidence":"At ~328 words it is tight and non-padded; every clause carries an evaluative point (CI width, single-task scope, speed-not-quality, developer-run instrumentation). No pedagogical filler. Length and density are consistent with a short journal Comment."},{"name":"hedging_calibration","band":"at_standard","evidence":"Hedging is well-calibrated: it credits the RCT identification ('makes the causal claim credible for this task'), bounds claims appropriately ('a time saving is not yet a demonstrated productivity gain'), and assigns 'Severity moderate' rather than overstating. Claims are neither overconfident nor mushy."},{"name":"supporting_references","band":"well_below","evidence":"Only ONE reference (Shadish, Cook & Campbell 2002) against the 6-15 standard for a Comment, and zero in-text author-date citations in the section bodies. Methodological points that demand grounding — RCT external validity, the speed-vs-quality outcome distinction, replication-package/auditability norms, CI interpretation — are asserted without any cited literature."},{"name":"citation_grounding","band":"below","evidence":"The single reference is a real, on-point classic for the generalizability/external-validity point in 'Precision and scope', so it is not fabricated or off-topic. But it is a textbook citation rather than the targeted literature a Comment would marshal, it carries an incorrect source field (a book attributed to 'Journal of the American Statistical Association'), it is never anchored to a specific claim in-text, and it does nothing to ground the auditability or speed-vs-quality arguments."}]},{"critiqueId":"the-politics-of-artificial-intelligence-alignment","refs":0,"overallBand":"below","distinguishingTell":"Zero in-text citations and an empty reference list — a real top-journal Comment would ground its pooling/external-validity critique in 6-15 cited works with 1-3 author-date citations per paragraph; this has none.","dimensions":[{"name":"narrative_coherence","band":"approaching","evidence":"The two micro-sections read cleanly and the argument flows logically from setup to limitation, but real Comments are continuous narrative with no headers; the 'What the paper does' / 'The pooled result' header scaffold plus the truncated abstract ('rests o') signals a templated mini-report rather than a self-standing flowing Comment."},{"name":"academic_register","band":"at_standard","evidence":"Register is formal, evaluative, and precise ('the headline rest on the pooled estimate rather than on a result that replicated across both stimuli'), appropriately conceding the authors flag the limitation themselves. Tone matches a sophisticated-reader Comment."},{"name":"compactness","band":"at_standard","evidence":"At ~234 words it is tight, non-pedagogical, and dense with content; no padding. Compactness is genuinely Comment-like."},{"name":"hedging_calibration","band":"at_standard","evidence":"Hedging is well-calibrated: 'a real but appropriately-hedged limitation the authors themselves flag', 'supported more narrowly than it first reads', and 'Severity low' avoid overclaiming while still registering the concern."},{"name":"supporting_references","band":"well_below","evidence":"Zero supporting references against the 6-15 standard. Real Comments ground every methodological point (pooling significant+nonsignificant conditions, underpowered-replication, external-validity from a single product) with 1-3 citations per paragraph; here there are none and no reference list."},{"name":"citation_grounding","band":"well_below","evidence":"No citations exist at all, so none can be on-point. Methodological claims about pooling and generalizability are asserted without anchoring to any methods or domain literature."}]}],"parity_proof":{"runDate":"2026-06-22","method":"blind 2-condition discrimination; cleaned specimens (front-matter/OCR confound removed after a caught first run); single high-effort judge per specimen per condition; 3 real-Comment + 5 engine prose specimens","headline":"Judges distinguished engine critiques from real Comments at 8/8 (100%) in BOTH conditions — even on prose register alone. The engine is NOT yet at prose parity. Condition B (with the reference apparatus) raised confidence (6/8 high vs 4/8) but was not needed.","caveats":"n=8 (3 real / 5 engine), single judge per cell, base-rate-imbalanced; a first run was invalidated by front-matter/OCR artifacts in the real specimens (caught + corrected); real specimens retain minor OCR noise (judges instructed to ignore; cited tells are register/structure, not OCR). Conditions A and B judge the SAME specimens, so they are NOT independent (16 trials): condition A (prose-register-only, 8/8, exact-binomial p=0.0039, Wilson-95%-lower-bound 0.68 > the 0.6 parity threshold) is the load-bearing result; condition B only raises confidence, not sample size. The NOT-at-parity verdict is robust; the 5 enumerated register tells are single-rater hypotheses (the existence of a detectable gap is the robust claim, not each specific tell).","conditions":[{"key":"A_prose_register","label":"Prose register only","n":8,"correct":8,"accuracy":1,"highConfN":4,"highConfCorrect":4},{"key":"B_prose_plus_refs","label":"Prose + reference-list size","n":8,"correct":8,"accuracy":1,"highConfN":6,"highConfCorrect":6}],"engineRegisterTells":[{"id":"engine-brynjolfsson-li-raymond-","confidence":"medium","tell":"The exhaustive, comprehensive name-dropping of all four modern staggered-DiD estimators in one breath (\"Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna and Borusyak et al.\"), combined with the sprawling opening sentence that bundles every conceivable objection with parenthetical triple-enumerations (\"Gemini fluency scores, SiEBERT sentiment, embedding-based convergence\"), reads like AI completeness rather than a real Comment's narrower, single-threaded focus."},{"id":"engine-the-cybernetic-teammate-","confidence":"high","tell":"The signature AI-critique rhetorical move \"the most quotable conclusions are the least supported by the design's actual scope\" — a polished, generalized antithesis framing that prizes aphoristic balance over the concrete, citation-anchored point a real journal Comment would make."},{"id":"engine-artificial-collusion-exa","confidence":"medium","tell":"The argument is built on a tidy, self-contained \"gotcha\" structure — quoting the paper's own abstract hedge ('remains to be seen') back against its policy claim and framing a clean scope-mismatch (Q-learning vs. 'pricing algorithms' in general). This symmetrical \"rightly deflates X but replaces it with over-strong Y\" rhetorical scaffolding, with the internal-contradiction reveal neatly packaged in a few sentences, is the signature move of an AI-generated critique. A real published Comment typically grounds the objection in external evidence or methodology rather than purely re-reading the target's own wording for tension."},{"id":"engine-peng-copilot-developer-p","confidence":"high","tell":"The four-noun parallel summative verdict (\"its precision, generality, completeness, and auditability are all weaker than...\") is a signature LLM construction — an over-engineered, schematized list of abstract evaluative dimensions that maps the critique onto a tidy taxonomy, paired with the conceded-redundancy of restating the \"(21-89%)\" interval and task/language points twice in adjacent sentences, which reads as machine-generated recap rather than the economical, forward-moving argument of a real published Comment."},{"id":"engine-the-politics-of-artifici","confidence":"medium","tell":"The passage restates the same point twice in near-identical terms (\"significant for one image set but not the other... carried by the pooled estimate\" then \"first image set produced significant effects; the second did not reach significance... by pooling\"), a redundant recap-and-rephrase rhythm typical of LLM exposition rather than the economical, forward-moving prose of a real published Comment."}],"atParity":false},"parity_proof_v2":{"runDate":"2026-06-22","method":"applied the G67-validated revision pass to the ACTUAL published critiques' prose (the same 5 engine specimens as the G65 parity proof), then blind-judged baseline-engine + revised-engine + real-Comment prose (register only) in one run","baselineDiscrimination":1,"revisedDiscrimination":1,"baselineEngineCaught":5,"revisedEngineCaught":5,"engineN":5,"reachesParity":false,"headline":"The revision pass did NOT close the gap on real critiques: blind discrimination stayed 100%% (5/5 revised-engine still identified as engine). G67's tell-count metric is a GAMEABLE PROXY — reducing the 5 enumerated tells does not reduce discriminability.","deeperGap":"Judges identified DEEPER structural register on the REVISED prose: over-symmetry, schematic diagnostic arcs, parallel tricolons, restating a point multiple ways, and — notably — the revision's 'ground in external literature' instruction produced DECORATIVE name-drops (a NEW tell). Presentation parity is harder than tell-targeted revision; the engine's prose carries a structural LLM fingerprint that survives it.","residualTells":[{"id":"engine-brynjolfsson-li-raymond-","confidence":"medium","tell":"The mechanical exhaustiveness of the robustness recital — naming all four DiD estimators in a tidy series (Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna, Borusyak et al.) plus IV, alongside a perfectly stacked list of four fixed effects and a near-checklist of caveats (rare events, AI-generated outcomes, non-random timing, proprietary data) — reads like comprehensive coverage-by-enumeration rather than the selective, prioritized emphasis a human referee typically deploys in a compact Comment."},{"id":"engine-the-cybernetic-teammate-","confidence":"medium","tell":"The referencing is name-dropped as schematic illustration rather than precise attribution (\"Banerjee and Duflo on the limits of one-site trials; List's work on what makes effects generalize\") — invoking authorities as decorative parentheticals without specific claims or works, a hallmark of engine-generated authority-signaling. Combined with the tidy diagnostic arc (gap-naming, methodological label, then a balanced concession-style fix in the closing sentence), the register reads as the over-symmetrical, self-contained reasoning of an AI critique rather than the more idiosyncratic, terse compression of a real published Comment."},{"id":"engine-artificial-collusion-exa","confidence":"medium","tell":"The schematic \"gap\" rhetoric built on tidy parallel tricolons (\"the learning rule, the action space, and the market environment\"; \"a single algorithm class\") plus the self-consciously balanced closer (\"the kind of extrapolation the paper's own design does not license\") reads as the over-symmetrical, frictionless argumentative cadence typical of LLM critique rather than the looser, more idiosyncratic phrasing of a published Comment."},{"id":"engine-peng-copilot-developer-p","confidence":"high","tell":"The over-engineered scaffolding of a single point: a CI restated three ways (\"21% to 89%\", \"spread so wide\", \"marginal gain or near-doubling\"), the schematic \"is not what X is taken to mean\" framing, and a tidy closing antithesis (\"promising single estimate, not a benchmark\"). This explain-the-statistic-to-the-reader didactic register, plus generic name-dropped scaffolding (SPACE framework, \"replication package\") deployed as balanced caveats rather than pointed argument, reads as AI critique rather than the compressed, assumption-laden prose of a real top-journal Comment."},{"id":"engine-the-politics-of-artifici","confidence":"medium","tell":"The argument is built on a too-neat, self-balancing rhetorical scaffold — \"Tested separately, one set moves and the other does not; reported together... the finding speaks to a narrower phenomenon than the framing suggests\" — with semicolon-balanced antitheses and the schematic \"X does heavy work here\" / \"defensible only when\" hedging cadence typical of LLM critique prose, rather than the more idiosyncratic, citation-anchored compression of a real published Comment."}]},"address_format":{"runDate":"2026-06-23","finding1":"PARTIALLY ADDRESSED. The format transformation (flowing Comment) genuinely closed the NARRATIVE-COHERENCE gap: the titled-section template is gone and an independent refute-by-default audit upheld narrative at at_standard/approaching on all 4. But the REFERENCE-DENSITY gap is only improved, NOT closed: the rubric scorer over-credited it (caught by the cadence audit), which re-scored in-text references at below/well_below — the inline citations are front-loaded (cluster in early paragraphs, then critique paragraphs run citation-free) and the central study being critiqued is often not cited inline, far from the sustained 1-3-per-paragraph Comment norm. Net: structure fixed; citation density raised from ~zero but still below standard.","finding2":"NOT closed — blind discrimination stayed 4/4 engine caught (100%) even after the DEEPER structural de-fingerprinting PLUS the format change. The LLM-prose fingerprint is a MOVING TARGET: each de-fingerprinting pass relocates the tell (G69 -> decorative citations; G70 -> aphoristic 'concession-then-correction' epigrams, which judges flagged). Prompt-level intervention cannot reach indistinguishability; closing it needs fine-tuning (gated) or accepting the AI voice. Doubly evidenced (G69 + G70).","discrimination":{"correct":7,"total":7,"engineCaught":4,"engineN":4},"residualTells":[{"id":"brynjolfsson-li-raymond-generative-ai-at-work-qje-2025","confidence":"high","tell":"The punchy, aphoristic closers (\"Believe the headline. Hold the mechanism story at arm's length.\" / \"the harder problem is not the headline at all\") combine with decorative, non-load-bearing citations — Galton (1886) for regression-to-the-mean and Polanyi (1966) for tacit knowledge are gestural \"name-drop the canonical origin\" moves typical of LLM critique rather than the working citation fabric of a real journal Comment, which would engage the cited methods substantively rather than invoking them as flourishes."},{"id":"the-rise-of-ai-sovereignty","confidence":"high","tell":"The argument marches through a rigid evaluative scaffold built on near-formulaic abstract antitheses (\"the distance between what it claims and what its design can warrant,\" \"The warrant thins elsewhere,\" \"its conclusions outrun their evidence,\" \"The normative payoff inherits this deficit\"), each paragraph executing the same claim-vs-evidence template with paired-citation parentheticals (Flyvbjerg, 2006; Gerring, 2004) deployed as generic methodological warrants rather than as engagement with a live scholarly conversation. Real top-journal Comments cite to contest or extend specific arguments; here citations function decoratively to certify abstract epistemological points (Entman on framing, Jasanoff & Kim on imaginaries), and the prose leans on signature LLM balance-and-antithesis cadence (\"Scoped to hypotheses... Stated as it stands\") that prizes symmetry over the idiosyncratic, paper-specific texture of genuine peer commentary."},{"id":"peng-copilot-developer-productivity","confidence":"high","tell":"The prose runs on a relentless template of \"concessive-then-correction\" aphorisms that collapse each point into a tidy epigram — \"the finding stands; it is simply bounded,\" \"code showed up faster... It says nothing about whether,\" \"I mean this structurally, not as an accusation\" — a rhythmic, self-balancing antithesis machine. Real journal Comments cite specific results, numbers, and authors with concrete attachment to the source paper, whereas here the citations are generic methodological touchstones (Glass 2003 on maintenance, Peng 2011 on reproducibility, Munafò et al. 2017 on replication) dropped in as decorative authority for self-evident claims rather than woven into a genuine argumentative thread. This citation fabric — invoking famous \"reproducibility/replication crisis\" references to backstop platitudes — is a hallmark of LLM-generated critique."},{"id":"the-cybernetic-teammate-a-field-experiment-on-gene","confidence":"high","tell":"The prose runs on a relentless engine of antithetical aphorism — \"most quotable exactly where its design can carry the least,\" \"cleanest result is also its smallest,\" \"softest evidence ends up carrying some of its boldest language,\" \"did it work here\" vs. \"will it work there\" — a balanced-reversal cadence stacked far denser than real Comments sustain. Paired with the thin, recycled citation fabric (the same three references — Deaton & Cartwright; Al-Ubaydli, List & Suskind — cycled to anchor generic external-validity boilerplate) and the conspicuously performed fairness moves (\"I don't want to discount that,\" \"To its credit,\" \"None of this is a charge\"), it reads as engine-generated rhetorical scaffolding rather than a real referee's substantive, source-rich argument."}],"auditNote":"Cadence audit (mechanism 5) independently re-verified finding 1 refute-by-default and found the rubric scorer INFLATED supporting_references on 4/4 renderings (claimed approaching/at_standard; audited below/well_below). narrative_coherence upheld. The audited bands are recorded; the verdict is corrected to PARTIALLY addressed. Inline-citation integrity separately confirmed deterministically: 0 fabricated inline cites (all map to the verified reference list).","caveat":"n=4 engine vs 3 real specimens, single judge; the real specimens are economics Comments while the engine targets are AI-topic, so a small part of the discrimination signal could be topic rather than register.","renderings":[{"id":"brynjolfsson-li-raymond-generative-ai-at-work-qje-2025","abstract":"Brynjolfsson, Li, and Raymond's (2025) flagship 15% productivity finding is well identified, but the paper's more ambitious claims about durable learning, communication quality, and worker experience rest on rare natural-experiment variation and on outcomes manufactured by other unvalidated AI systems, warranting more cautious interpretation than the headline result.","prose":"The central productivity result in Generative AI at Work holds up. What the paper builds on top of it does not, at least not to the same degree. Access to a GPT-3-based assistant raised resolutions per hour by roughly 15 percent, and for this setting that number is well identified. It barely moves as fixed effects pile on, drifting from 23.9 to 15.2 percent, and it survives the staggered-adoption estimators built to recover causal effects when treatment timing is heterogeneous across many periods (Callaway and Sant'Anna, 2021; Sun and Abraham, 2021).\n\nThe skill gradient is also convincing on its own terms. Gains run as high as 30 to 36 percent for the lowest skill quintile and shrink to almost nothing for top agents, and the pattern repeats across five outcomes and two cross-cutting dimensions. The authors go looking for mechanical artifacts and mostly fail to find them. But skill here is measured from baseline performance, so the workers who gain the most are by construction the ones who scored worst at the start. Whatever part of a low baseline score was simply a bad stretch will look like improvement once treatment arrives. That leaves a regression-to-the-mean signature baked into the convergence pattern, the same trap that has dogged designs of this shape since Galton (1886) — and the authors' decision to call the result merely \"suggestive\" reads as a quiet acknowledgment of exactly this.\n\nThe harder problem is not the headline at all. Take the claim that workers durably learn. It hangs on agents staying fast during AI outages, which is an ingenious bit of natural experiment, except that outages are rare and noisy, the chat mix during them may not resemble normal operation, and the authors concede the estimates are imprecise. Then there are the communication and experience results, and here the worry deepens. English fluency improves; low-skill writing drifts toward high-skill writing; customer sentiment climbs half a standard deviation; escalations fall. None of these is observed directly. Each is the output of another machine-learning system — Gemini scoring fluency, SiEBERT classifying sentiment, embedding distances standing in for stylistic convergence — and the biases of those systems cannot be checked from outside the study, nor ruled out as correlated with the very treatment being evaluated. Reading the convergence as tacit knowledge flowing from strong performers to weak ones is an appealing story, and Polanyi (1966) gives it a name, but the convergence evidence underneath is once again, in the authors' own words, only suggestive.\n\nSet all of this on top of rollout timing that managers chose rather than randomized, and the measurement dependencies start to matter a great deal. The paper's most interesting conclusions about learning, communication, and worker welfare stand on far softer ground than its core estimate does. Believe the headline. Hold the mechanism story at arm's length.","referenceList":["Brantly Callaway and Pedro H. C. Sant'Anna (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics.","Francis Galton (1886). Regression Towards Mediocrity in Hereditary Stature. The Journal of the Anthropological Institute of Great Britain and Ireland.","Michael Polanyi (1966). The Tacit Dimension. Knowledge in Organisations.","Liyang Sun and Sarah Abraham (2021). Estimating Dynamic Treatment Effects in Event Studies With Heterogeneous Treatment Effects. SSRN Electronic Journal."],"afterDims":{"narrative_coherence":"at_standard","academic_register":"at_standard","compactness":"at_standard","hedging_calibration":"at_standard","supporting_references":"approaching","citation_grounding":"approaching"},"afterOverall":"approaching","afterMean":2.67,"auditedNarrative":"approaching","auditedReferences":"below","beforeStructured":{"narrative_coherence":"below","supporting_references":"well_below","overall":"below"}},{"id":"the-rise-of-ai-sovereignty","abstract":"A framework that pairs sociotechnical imaginaries with reflexive control to read authoritarian AI governance is conceptually generative, but its causal and typological conclusions outrun an evidentiary base built from the public statements of a single head of state.","prose":"The central difficulty is not the framework but the distance between what it claims and what its design can warrant. Asmolov pairs sociotechnical imaginaries with reflexive control, and this is a defensible theory-building move. Imaginaries give us the interpretive register: how AI is collectively envisioned (Jasanoff & Kim, 2009). Reflexive control names something different, the strategic register, the art of inducing an adversary to act in ways that serve the influencer's aims (Thomas, 2004). Holding these two registers together is the essay's real intellectual product. It motivates a reframing — away from AI-as-instrument-of-disinformation, toward AI-governance-as-contested-terrain — that addresses a precise gap. The existing literature has fixated on what generative AI does as a propaganda tool, leaving comparatively untouched the governance frames through which autocracies project influence (Gunitsky, 2015). Naming \"AI sovereignty\" as the trackable analytic object also helps, since it keeps the framework tethered to observable referents. And the reading of Russia as carrying Cold War nuclear logics into a new register of security and survival is the most credibly scoped empirical claim the essay offers.\n\nThe warrant thins elsewhere. The design rests on public statements by one head of state — one regime, one speaker — yet it is asked to underwrite typological claims about \"the authoritarian AI imaginary\" as such, and to say something about \"authoritarian diffusion\" more broadly. A single information-rich case is a respectable basis for concept formation and hypothesis generation (Flyvbjerg, 2006; Gerring, 2004). What it can license, though, are propositions put forward for further testing, not the categorical generalizations the abstract advances. So the three frames — AI as a tool of domination, Western AI as cultural threat, AI as guarantor of state survival — read most naturally as hypotheses. The apparatus that produces them is a framing analysis, and framing analysis characterizes a message; it does not measure whether anyone took the message up (Entman, 1993). That gap bears directly on the load-bearing claim that such imaginaries \"shape the cognitive environment\" of global policy debates. Studying what one speaker says can establish what was said. It cannot establish reception, nor causal influence on third parties.\n\nThe normative payoff inherits this deficit. Asmolov wants the reflexive-control lens to let analysts separate legitimate Global South concerns from authoritarian influence. But that promise presupposes the very discriminating criterion the single-case design has not been shown to supply. Once reflexive control is applied interpretively, any sovereignty claim can be redescribed as genuine or as manipulated, with nothing independent to break the tie. Scoped to hypotheses, the contribution is valuable. Stated as it stands, its conclusions outrun their evidence.","referenceList":["Entman, R. M. (1993). Framing: Toward clarification of a fractured paradigm. In Schlüsselwerke: Theorien (in) der Kommunikationswissenschaft.","Flyvbjerg, B. (2006). Five misunderstandings about case-study research. In Case Studies.","Gerring, J. (2004). What is a case study and what is it good for? In Case Studies.","Gunitsky, S. (2015). Corrupting the cyber-commons: Social media as a tool of autocratic stability. Perspectives on Politics.","Jasanoff, S., & Kim, S.-H. (2009). Containing the atom: Sociotechnical imaginaries and nuclear power in the United States and South Korea. Minerva.","Thomas, T. L. (2004). Russia's reflexive control theory and the military. The Journal of Slavic Military Studies."],"afterDims":{"narrative_coherence":"at_standard","academic_register":"at_standard","compactness":"at_standard","hedging_calibration":"at_standard","supporting_references":"at_standard","citation_grounding":"at_standard"},"afterOverall":"at_standard","afterMean":3,"auditedNarrative":"at_standard","auditedReferences":"approaching","beforeStructured":null},{"id":"peng-copilot-developer-productivity","abstract":"The widely cited \"55.8% faster\" effect of GitHub Copilot is real but narrower than its near-universal invocation suggests: it is an imprecise estimate from a single greenfield task, measures speed rather than quality, and rests on telemetry instrumented by the tool's own maker without a described replication package.","prose":"The figure that GitHub Copilot makes developers roughly 56% faster now circulates as shorthand for what AI does to programming. But the distance from that one experiment to \"developer productivity\" is longer than the citation count suggests. Start with what the study gets right. Because Copilot access was randomized, the design cleanly identifies the tool's causal effect on the task it measured, and on that narrow question the evidence holds up.\n\nWhat the headline number hides first is its own imprecision. The 56% is the midpoint of a confidence interval running from about 21% to 89%, drawn from 95 freelancers. An effect that could be modest or could be enormous gets reported as a single point, which fixes far more than the data do.\n\nThe deeper problem is external validity. Subjects wrote a single self-contained greenfield program in JavaScript, working alone. That happens to be the corner of software work that resembles professional practice least. Most real engineering is maintenance, reading code you didn't write, debugging inside large systems, and coordinating with other people, not composing fresh functions against a clean spec (Glass, 2003). Reading \"productivity\" off greenfield speed reaches well past what was measured.\n\nAnd speed is only part of the story even within the task. Saving 56% of the time tells us code showed up faster. It says nothing about whether that code is correct, maintainable, or secure. With no defect or rework outcomes recorded, a completion-time endpoint cannot see the hours that resurface later as correction, so faster finishing is not yet a net gain.\n\nThere is also the matter of who measured it. The outcome comes from telemetry the tool's own maker collected and instrumented, and the paper describes no replication package, which leaves the measurement pipeline closed to independent checking and the estimate closed to reanalysis. I mean this structurally, not as an accusation of bad faith. Reproducible computational research asks that the data and analysis code travel with the claim (Peng, 2011), and an empirical literature earns trust through routine replication rather than deference to one instrumented number (Munafò et al., 2017).\n\nSo the finding stands; it is simply bounded. The experiment licenses a clean causal claim about one tool on one kind of task. The jump to general developer productivity, and the confident repetition of a lone midpoint, go further than this evidence yet allows.","referenceList":["Glass, R. L. (2003). Facts and Fallacies of Software Engineering. The Journal of Object Technology.","Peng, R. D. (2011). Reproducible Research in Computational Science. Science.","Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour."],"afterDims":{"narrative_coherence":"at_standard","academic_register":"at_standard","compactness":"at_standard","hedging_calibration":"at_standard","supporting_references":"approaching","citation_grounding":"approaching"},"afterOverall":"approaching","afterMean":2.67,"auditedNarrative":"at_standard","auditedReferences":"well_below","beforeStructured":{"narrative_coherence":"approaching","supporting_references":"well_below","overall":"below"}},{"id":"the-cybernetic-teammate-a-field-experiment-on-gene","abstract":"A field experiment can credibly show that AI helped one professional match a two-person team at a single firm, but it cannot license the paper's headline claim that generative AI reshapes knowledge work or substitutes for a human teammate.","prose":"\"The Cybernetic Teammate\" is most quotable exactly where its design can carry the least. Its cleanest result is also its smallest. Random assignment within the firm pins down the contrast between AI-augmented individuals and unaugmented two-person teams, and randomization is what lets us read that contrast causally (Deaton & Cartwright, 2018). I don't want to discount that; the internal-validity claim is earned. The trouble starts when the paper's framing outruns the contrast it actually identifies. One consumer-goods firm, one class of product-innovation tasks, one deployment window — and these are asked to stand in for \"knowledge work\" and for the idea that AI can take a teammate's place. To its credit the paper softens the move with words like \"suggest,\" so this is overreach of a mild kind. But the rhetorical weight has shifted onto the general claim and off the specific estimate.\n\nGoing from a single site to a population is not a question of how confident we are. It is a question of what kind of evidence we have. An internally valid treatment effect tells you nothing on its own about what the effect would be in another firm, another task, or another workforce, and the reason is that the mechanisms producing the effect are themselves what change at scale: the structure of these particular tasks, the partial-equilibrium novelty of a tool nobody has gotten used to yet, the selected sample, the organization doing the implementing (Al-Ubaydli, List & Suskind, 2017). Randomization answers \"did it work here.\" It does not answer \"will it work there\" — and that gap between a well-estimated local effect and a transportable one is where strong field-experimental claims tend to break (Deaton & Cartwright, 2018).\n\nThe teammate-substitution finding runs into trouble from another angle. The claim that AI partly filled a social and motivational role is genuinely interesting. But it rests on self-reported affect gathered shortly after a conspicuous new tool arrived, which leaves it exposed to novelty and demand effects in a way the output measures are not. So the paper's softest evidence ends up carrying some of its boldest language.\n\nNone of this is a charge against the experiment. The ask is narrower: bring the abstract's reach back down to what the design can hold up, which is a credible local demonstration rather than a general law of AI and teamwork (Al-Ubaydli, List & Suskind, 2017).","referenceList":["Al-Ubaydli, O., List, J. A., & Suskind, D. L. (2017). What Can We Learn from Experiments? Understanding the Threats to the Scalability of Experimental Results. American Economic Review, 107(5), 282-286.","Deaton, A., & Cartwright, N. (2018). Understanding and Misunderstanding Randomized Controlled Trials. Social Science & Medicine, 210, 2-21."],"afterDims":{"narrative_coherence":"at_standard","academic_register":"approaching","compactness":"at_standard","hedging_calibration":"at_standard","supporting_references":"approaching","citation_grounding":"at_standard"},"afterOverall":"approaching","afterMean":2.67,"auditedNarrative":"at_standard","auditedReferences":"below","beforeStructured":{"narrative_coherence":"approaching","supporting_references":"well_below","overall":"below"}}],"format_gap_addressed":false,"fingerprint_closed":false},"self_improvement":{"lessons":[{"id":"L1","lesson":"Lead with ONE narrowly-scoped objection; do not bundle every concern into a single sprawling sentence.","counters":"bundling"},{"id":"L2","lesson":"Name a method or prior work only where it does load-bearing work in the argument (and would carry a citation); do not rattle off a checklist of names.","counters":"checklist"},{"id":"L3","lesson":"Make each point plainly once; avoid aphoristic antitheses and \"gotcha\" framings.","counters":"gotcha"},{"id":"L4","lesson":"Do not write N-noun summative parallel verdicts; state the judgement concretely and specifically.","counters":"nnoun"},{"id":"L5","lesson":"Ground objections in external methodology or literature, not just the target's own wording or an internal contradiction.","counters":"self_referential"}],"held_out_ab":{"runDate":"2026-06-22","rows":[{"id":"the-rise-of-ai-sovereignty","baseline":1,"lessons":1,"delta":0},{"id":"unraveling-generative-ai-from-a-human-intelligence","baseline":1,"lessons":1,"delta":0},{"id":"farach-scaffolding-human-ai-collaboration","baseline":1,"lessons":1,"delta":0}],"baselineMeanTells":1,"lessonsMeanTells":1,"passed":false,"anyRegression":false,"verdict":"NOT ACTIVATED. On 3 held-out fresh targets the lessons did not reduce the mean register-tell count (1.00 vs 1.00): they suppressed the targeted tell (bundling, gotcha) but a different tell surfaced (self_referential, nnoun) — whack-a-mole, not elimination. A single-pass generation self-check does not move the engine to prose parity. The lessons stand as DIAGNOSED failure modes + a cautionary held-out record; a revision-pass v2 (generate, then revise to remove tells) is the open follow-up."},"active":false,"note":"Single-pass presentation lessons activate only if they beat a no-lessons baseline on a held-out A/B. The v1 single-pass set FAILED (no tell reduction). The v2 REVISION pass PASSED.","revision_pass_v2":{"runDate":"2026-06-22","mechanism":"revision pass (generate baseline -> rewrite to strip the 5 register tells)","rows":[{"id":"the-rise-of-ai-sovereignty","baseline":2,"revised":1,"delta":-1},{"id":"unraveling-generative-ai-from-a-human-intelligence","baseline":1,"revised":0,"delta":-1},{"id":"farach-scaffolding-human-ai-collaboration","baseline":1,"revised":1,"delta":0}],"baselineMeanTells":1.3333333333333333,"revisedMeanTells":0.6666666666666666,"passed":true,"anyRegression":false,"verdict":"ACTIVATED. On 3 held-out fresh targets the revision pass reduced the mean register-tell count from 1.33 to 0.67 with no per-target regression (deltas -1,-1,0) — where the single-pass self-check (G66) did not move it (1.00 vs 1.00). Gate metric is the concretely-defined tell count; a secondary 'reads-like-a-Comment' rating came back ambiguous/uniform and is NOT used as the gate. The revision pass is validated + available to the generation lane; it is not yet wired in by default. n=3, single judge — a proxy, honestly bounded."},"revision_pass_active":true}}