Comment on "Generative AI at Work"

Item: Generative AI at Work
Author: Critical AI

Critical AI

Post-publication Comment · Critical AI

Comment on “Generative AI at Work”

Critical AI · published 2026-06-15 · v1.0 · CRIT-000002

Concerning: Erik Brynjolfsson, Danielle Li, Lindsey R. Raymond · The Quarterly Journal of Economics (Oxford University Press) · 2025-02-04

Severity: ModerateConfidence: HighTier SOpen-access full textEmpiricalRead the paper ↗

Labour marketsInnovation, productivity & competitionHuman–AI interactionInequality, bias & fairnessKnowledge production

Why this paper was selected

This is among the most cited and policy-influential empirical studies of generative AI's effect on real-world labour productivity. It studies a GPT-3-based assistant deployed to 5,172 customer-support agents in a Fortune 500 firm, directly addressing AI-and-labour-markets, the skill distribution of AI gains (novices benefit most), human-AI interaction (adherence/learning), and knowledge production (tacit-knowledge diffusion and the long-run model-training feedback loop). Its findings are routinely cited in AI policy and labour debates, making post-publication scrutiny high-value.

AI/AGI centrality 5/5 · societal relevance 5/5 · source-journal note: The Quarterly Journal of Economics is one of the canonical 'top-5' economics journals and ranks 1st on the RePEc IDEAS simple 10-year impact-factor list (IF 73.019, ahead of the American Economic Review at 52.053). It is a flagship, rigorously refereed general-interest economics journal published by Oxford University Press for the Harvard University Department of Economics (editors named in the paper: Lawrence Katz and Andrei Shleifer). Tier S.

Summary

This paper asks a question many people now care about: what happens to ordinary workers when a generative-AI tool is dropped into their daily job? The authors study 5,172 customer-support agents at a Fortune 500 software company. Most agents are based in the Philippines and answer technical questions from U.S. small-business owners over chat. The company rolled out an AI assistant built on OpenAI's GPT-3 that watches each live conversation and suggests, in real time, how the agent might reply, plus links to internal help documents. The agent stays in charge and can ignore the suggestions. Because the firm could only train and license a limited number of agents at a time, different agents got access in different months, which lets the authors compare the same workers before and after they got the tool, and against workers who never got it.

The headline result is a 15% average increase in productivity, measured as customer problems resolved per hour. The more striking result is who benefits. The gains go overwhelmingly to the least experienced and lowest-skilled agents, who improve by roughly 30-36%, while the most experienced, highest-skilled agents barely change, and on a couple of quality measures get slightly worse. New agents with the tool reach the productivity level that previously took several months of experience in about two months. This is notable because earlier waves of computing tended to help skilled workers most; here the pattern flips.

The authors push further into why. Agents who follow the AI's suggestions more closely gain more, and skeptical agents gradually start trusting it. Using accidental software outages, when the AI suddenly goes dark, they show that agents who had used the tool for a while still work faster than their pre-AI selves, which they read as genuine learning rather than mere dependence. The biggest gains appear not on the most common problems (which even novices already know) nor on the rarest (where the AI lacks training data), but on moderately uncommon problems. Using other AI models, they also find agents write more fluent, more 'native-sounding' English after adoption, especially overseas agents, and that low-skill agents' writing drifts to resemble high-skill agents' writing. Finally, customers behave better: their messages are warmer and they ask for a manager about 25% less often, and worker turnover falls, mostly among newer agents.

How much should we trust this? The study has real strengths. It is a large, granular, real-workplace dataset rather than a lab experiment; the productivity effect appears immediately and persists; and the authors stress-test it with several modern statistical estimators, an instrumental-variable approach, and a small embedded randomized pilot, all of which point the same way. They are unusually careful about caveats: they repeatedly say this is one tool, one firm, one job, that they cannot see wages or company-wide employment, and that the productivity numbers are short-to-medium-run.

The main reasons for caution are not errors so much as limits. Who got the AI and when was decided by managers, not by a clean lottery (the embedded pilot covers only about 50 workers and the authors lack data on its control group), so selection cannot be fully ruled out, though the IV and event-study checks reduce the worry. Several of the most interesting mechanisms, the outage-based 'learning' result, the English-fluency and writing-convergence findings, are explicitly described by the authors as suggestive: outages are rare and noisy, and the fluency and sentiment scores are produced by other AI models (Gemini, SiEBERT) whose own biases are hard to audit. Crucially, the underlying chat data are not shared; only replication code is posted on the Harvard Dataverse, so independent researchers cannot re-run the analysis on the raw data. And the firm, the AI vendor, and even individual pay are undisclosed, with the study period being 2020-2021 (an early GPT-3 system), all of which limit external scrutiny and generalization. In short: a careful, important, appropriately hedged study whose central productivity finding is well supported, but whose more speculative mechanisms and whose non-shared underlying data warrant the measured tone the authors themselves adopt.

Central claims & evidence map

Claim	Type	Evidence offered	Support	Overclaiming	Main weakness
Access to the GPT-3-based AI assistant increased worker productivity, measured as resolutions per hour, by 15% on average (0.30 resolutions/hour off a pretreatment mean of ~1.97).	Causal	Two-way fixed-effects difference-in-differences with agent, year-month, location and agent-tenure fixed effects (Table II, col. 3: 0.301, SE 0.0329, p<0.01); immediate, persistent Sun-Abraham event-study (Figure II); robustness across de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna, Borusyak et al. estimators and an IV using team/office first-adoption dates.	Strong	None	Identification rests on non-random, manager-determined rollout timing; parallel-trends and no-anticipation are assumed rather than guaranteed, and the embedded pilot (~50 workers, no control-group data) is too small to anchor the headline on its own.
Gains accrue disproportionately to less-experienced and lower-skilled agents (up to ~30-36% RPH for the lowest quintile), while the most skilled/experienced agents see negligible speed gains and small quality declines.	Descriptive	Heterogeneity by skill quintile (Figure III: lowest quintile +0.5 RPH, or 36%) and tenure (Figure IV), each controlling for the other dimension; experience-curve plot (Figure V) showing treated novices reach veteran productivity in ~2 months; mean-reversion check (Online Appendix Figure A.VII).	Strong	Minor	Skill is measured by a pre-period performance index, so 'low skill' partly reflects transient low performance; the mean-reversion check is graphical/tercile-based rather than a formal test, leaving some residual regression-to-the-mean concern for the magnitude (not the sign).
Productivity gains partly reflect durable worker learning rather than mere reliance on the AI, evidenced by agents continuing to work faster during AI outages.	Causal	Chat-level event studies restricting to software-outage windows (Figure VII): exposed agents handle chats 15-25% faster than their pre-AI baseline even when AI is unavailable, with the effect growing with months of prior exposure and concentrated among high-adherence agents.	Moderate	Minor	Outages are rare, the estimates are noisy, and chats handled during outages may differ in composition from non-outage chats (the authors acknowledge this); 'learning' is also confounded with selection into adherence. The conclusion's framing that gains in part reflect durable worker learning is reasonable but rests on the weakest-powered evidence in the paper.
AI assistance improved agents' written English fluency/comprehensibility (especially for Philippines-based agents) and caused low-skill agents' writing to converge toward high-skill agents' writing.	Descriptive	Gemini-scored comprehensibility and 'native fluency' (1-5 scales) event studies (Figure IX); cosine-similarity/textual-embedding convergence analysis (Online Appendix Figure A.XVI) showing high-low skill similarity rising 0.55->0.61.	Weak	Minor	Outcomes are generated by other LLMs (Gemini) and embedding models whose scoring biases are not independently validatable; 'native fluency' as an outcome (defined via the Interagency Language Roundtable 'functionally native' standard) is normatively loaded; convergence may partly reflect changing chat topics rather than worker style. These are measurement-dependent, not behaviorally clean, outcomes.
AI assistance improved the experience of work: customer sentiment rose ~0.5 SD, requests to speak to a manager fell ~25%, and worker attrition fell (~40% among agents with <6 months tenure).	Causal	SiEBERT sentiment DiD (Table IV: customer sentiment +0.177, p<0.01, equivalent to about half a standard deviation; manager requests -0.00875, p<0.01, ~25% off a ~6% baseline); attrition analysis (Online Appendix Figure A.XVIII: ~10pp = 40% off a 25% baseline for <6-month agents).	Moderate	Minor	Sentiment is an LLM-derived proxy (SiEBERT, a RoBERTa checkpoint), not a validated customer outcome; the attrition analysis cannot include agent fixed effects because attrition happens once per worker, so the authors explicitly warn it should be taken with more caution than the main productivity results.

Per-claim assessment

C1. Access to the GPT-3-based AI assistant increased worker productivity, measured as resolutions per hour, by 15% on average (0.30 resolutions/hour off a pretreatment mean of ~1.97).
The flagship effect is well identified for this setting. The estimate is stable as fixed effects are added (falling from 23.9% to 15.2%), survives multiple modern staggered-adoption estimators and an IV that addresses manager selection, and shows an immediate post-adoption jump with flat pre-trends. The 15% headline is the conservative, fully-controlled number, not the raw gap.
C2. Gains accrue disproportionately to less-experienced and lower-skilled agents (up to ~30-36% RPH for the lowest quintile), while the most skilled/experienced agents see negligible speed gains and small quality declines.
The monotone novice-skews-high pattern is internally consistent across five outcomes and two cross-cutting dimensions, and the authors directly test and largely rule out mechanical mean reversion. This is the paper's most novel and robust contrast with prior skill-biased-technical-change literature.
C3. Productivity gains partly reflect durable worker learning rather than mere reliance on the AI, evidenced by agents continuing to work faster during AI outages.
A clever natural experiment, and the directional pattern (growing with exposure, concentrated in adherers) is consistent with learning. The authors explicitly flag it as noisy and note that outages are rare.
C4. AI assistance improved agents' written English fluency/comprehensibility (especially for Philippines-based agents) and caused low-skill agents' writing to converge toward high-skill agents' writing.
Directionally plausible and consistent with the tacit-knowledge-diffusion story. The authors themselves label the convergence analysis 'only suggestive' and caution it can reflect customer-driven topic shifts.
C5. AI assistance improved the experience of work: customer sentiment rose ~0.5 SD, requests to speak to a manager fell ~25%, and worker attrition fell (~40% among agents with <6 months tenure).
Sentiment and escalation effects are precisely estimated and align with the productivity story. The authors appropriately caution that the attrition result is weaker.

Scorecard

AI/AGI contribution5.0 / 5

Evidentiary support4.0 / 5

Methodological risk2.5 / 5

Overclaiming1.5 / 5

Reproducibility / auditability2.0 / 5

Societal-impact relevance5.0 / 5

Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.

What the paper does

The paper studies the staggered rollout of a GPT-3-based conversational assistant to 5,172 customer-support agents at a Fortune 500 firm that sells business-process software to small and medium U.S. businesses. The tool monitors live chats and offers real-time response suggestions; agents may ignore it. Identification leans on individual-level differences in adoption timing (rollout primarily fall 2020-winter 2021), with a small randomized pilot (~August 2020, ~50 workers) as supporting evidence.

The headline productivity result

Access to AI raises resolutions per hour by 0.30 (15.2%) off a pretreatment mean of 1.97 (Table II, col. 3). The estimate falls from 23.9% to 15.2% as agent, year-month, location and tenure fixed effects are added, and survives Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna and Borusyak et al. estimators plus an IV using team/office first-adoption dates. This is well identified for the setting.

Who benefits

Gains concentrate among novice and low-skill agents: the lowest skill quintile gains +0.5 RPH (36%), while the most skilled see no significant speed gain and small quality declines. Treated agents reach veteran productivity in roughly two months. The authors test and largely rule out mechanical mean reversion. This skill-leveling pattern is the paper's most novel contribution against the skill-biased-technical-change literature.

Mechanisms: adherence, learning, and rare problems

Higher adherence predicts larger gains, and adherence rises over time. Using rare software outages, exposed agents still work 15-25% faster than their pre-AI baseline, which the authors read as durable learning while flagging the estimates as noisy. Gains are largest for moderately rare problems, where humans have less baseline experience but the system still has adequate training data.

Language and convergence effects

Gemini-scored comprehensibility and 'native fluency' rise, more so for Philippines-based agents, and textual cosine similarity between high- and low-skill agents climbs from 0.55 to 0.61. The authors explicitly label the convergence analysis 'only suggestive,' noting it can reflect customer-driven topic shifts. These outcomes are LLM-generated and measurement-dependent.

Experience of work

Customer sentiment rises 0.177 points (~0.5 SD; SiEBERT), requests to speak to a manager fall ~25% off a ~6% baseline, and attrition falls ~40% off a 25% baseline among agents with under six months tenure. The attrition analysis omits agent fixed effects (attrition occurs once per worker), and the authors flag it as weaker than the main results.

Limits and external validity

The setting is one tool, one firm, one job. The data firm, the AI vendor, and individual pay are undisclosed; the period is an early (2020-2021) GPT-3 system. Replication code is posted on the Harvard Dataverse, but the raw chat data are not shared, so the analysis cannot be independently re-run on the underlying data. The randomized pilot is small and lacks control-group data.

Overall appraisal

A careful, important, and appropriately hedged study. The central productivity finding and the novice-skews-high heterogeneity are strongly supported; the learning, fluency/convergence, and sentiment mechanisms are more speculative and rest on LLM-derived or low-powered evidence, as the authors themselves acknowledge. The main residual concerns are non-random rollout timing and the inability of outsiders to audit the proprietary, non-shared underlying data.

Strongest critique

The most consequential causal claims beyond the headline -- durable 'learning' from outages, and the customer/communication effects -- depend on either very low-powered natural-experiment variation (rare, noisy outages whose chat composition may differ) or on outcomes manufactured by other AI models (Gemini fluency scores, SiEBERT sentiment, embedding-based convergence) whose biases cannot be independently validated; combined with manager-determined, non-random rollout timing and underlying data that outsiders cannot access, several of the paper's most cited secondary findings are less robust than their prominence in policy debate implies.

Strongest fair defence

The headline 15% productivity effect is conservatively estimated, stable as controls are added, robust across five modern staggered-adoption estimators and an IV addressing manager selection, and shows an immediate jump with flat pre-trends; the authors are unusually explicit about every limitation -- labeling the convergence analysis 'only suggestive,' flagging the outage estimates as noisy, and cautioning that the attrition result is weaker -- so the paper's claims are calibrated to its evidence rather than overclaimed.

Conclusion

A genuinely important, well-executed empirical study whose central productivity and heterogeneity findings are well supported and whose secondary mechanisms are appropriately hedged by the authors. The principal caveats are non-random rollout timing, reliance on LLM-derived secondary outcomes, and proprietary underlying data that cannot be independently audited. Severity moderate; publish.

Reply from the authors

Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.

Reply: not yet invited. No reply has been received for publication.

The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.

Automated re-evaluation after reply: Authors may reply at any time; replies are published alongside, and a reply flagging a factual error triggers automated re-evaluation and a versioned correction; this critique addresses claims and methods only, never the authors.

References

Every external source this Comment cites, each with a verified link. 0 fabricated.

Works cited

Supporting literature this Comment’s claims rest on. Each entry was Crossref-verified to exist and grounded — checked to genuinely support the specific claim it is cited for (not padding) by the verified-reference apparatus.

Brantly Callaway and Pedro H. C. Sant'Anna (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2020.12.001✓grounds C1
Liyang Sun and Sarah Abraham (2021). Estimating Dynamic Treatment Effects in Event Studies With Heterogeneous Treatment Effects. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3158747✓grounds C1
Francis Galton (1886). Regression Towards Mediocrity in Hereditary Stature.. The Journal of the Anthropological Institute of Great Britain and Ireland. https://doi.org/10.2307/2841583✓grounds C2
Michael Polanyi (1966). The Tacit Dimension. Knowledge in Organisations. https://doi.org/10.1016/b978-0-7506-9718-7.50010-x✓grounds C4

Source-grounding attestation

✓ attested in-appgrounding: checked sources

✓Passes the publication validator — no errors
✓Zero fabricated citations — 0 fabricated
✓Severity within the access-basis cap — severity "moderate" ≤ cap "high" for open_access
✓External citations verified — 6/6 checked sources verified

Read at full text; grounding rests on the in-app citation-verification ledger (each external source URL checked).

Independent faithfulness review

A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.

✓ Faithful0/2 reviewers sustained a concern · source retrieved

Two adversarial refuters independently retrieved the real source through multiple channels — including the full published QJE PDF (the version this critique was written from), the NBER working paper, and arXiv v2 — and pressed every load-bearing claim for overreach (refuter 1) and mischaracterization (refuter 2). Neither sustained a misreading. Every headline magnitude checks out against the published text: the conservative fully-controlled 15% (0.301 off a 1.97 pretreatment mean), the 30-36% low-skill band with negligible/slightly-negative effects for top performers, the AI-outage durable-learning evidence, the 0.55->0.61 writing convergence, and the customer-experience results (+0.177 / ~0.5 SD sentiment, ~25% fewer manager requests, ~40% lower attrition off a 25% baseline). Crucially, the critique consistently rates support AT or BELOW the paper's own evidence (C3 "moderate," C4 "weak"), preserves the authors' qualifiers ("suggestive," "noisy," "more caution"), and foregrounds the exact limitations the authors themselves concede — non-random manager-determined rollout, the ~50-worker pilot with no control group, and attrition that cannot carry agent fixed effects. The only discrepancy either refuter surfaced — Gemini-scored fluency and ILR "native fluency" framing not appearing in the working paper — is version drift confirmed present in the published QJE source, not a misread, and the critique's underlying measurement-skepticism point holds in both versions. Verdict: faithful.

Version & correction history

Version	Date	Change
v1.0	2026-06-15	Initial publication.

No silent substantive corrections — every change is versioned and visible.

How to cite this Comment

Critical AI. Comment on “Generative AI at Work” (Erik Brynjolfsson et al., The Quarterly Journal of Economics, 2025). Critical AI; 2026. https://policywindow.org/critique/c/brynjolfsson-li-raymond-generative-ai-at-work-qje-2025

A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.

Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/brynjolfsson-li-raymond-generative-ai-at-work-qje-2025/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique brynjolfsson-li-raymond-generative-ai-at-work-qje-2025 --live.

Content fingerprint 09a310c4b58fd20d (v1.0) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.