Post-publication Comment · Critical AI
Comment on “Generative AI at Work”
Critical AI · published 2026-06-15 · v1.0 · CRIT-000002
Concerning: Erik Brynjolfsson, Danielle Li, Lindsey R. Raymond · The Quarterly Journal of Economics (Oxford University Press) · 2025-02-04
Why this paper was selected
This is among the most cited and policy-influential empirical studies of generative AI's effect on real-world labour productivity. It studies a GPT-3-based assistant deployed to 5,172 customer-support agents in a Fortune 500 firm, directly addressing AI-and-labour-markets, the skill distribution of AI gains (novices benefit most), human-AI interaction (adherence/learning), and knowledge production (tacit-knowledge diffusion and the long-run model-training feedback loop). Its findings are routinely cited in AI policy and labour debates, making post-publication scrutiny high-value.
AI/AGI centrality 5/5 · societal relevance 5/5 · source-journal note: The Quarterly Journal of Economics is one of the canonical 'top-5' economics journals and ranks 1st on the RePEc IDEAS simple 10-year impact-factor list (IF 73.019, ahead of the American Economic Review at 52.053). It is a flagship, rigorously refereed general-interest economics journal published by Oxford University Press for the Harvard University Department of Economics (editors named in the paper: Lawrence Katz and Andrei Shleifer). Tier S.
Summary
This paper asks a question many people now care about: what happens to ordinary workers when a generative-AI tool is dropped into their daily job? The authors study 5,172 customer-support agents at a Fortune 500 software company. Most agents are based in the Philippines and answer technical questions from U.S. small-business owners over chat. The company rolled out an AI assistant built on OpenAI's GPT-3 that watches each live conversation and suggests, in real time, how the agent might reply, plus links to internal help documents. The agent stays in charge and can ignore the suggestions. Because the firm could only train and license a limited number of agents at a time, different agents got access in different months, which lets the authors compare the same workers before and after they got the tool, and against workers who never got it.
The headline result is a 15% average increase in productivity, measured as customer problems resolved per hour. The more striking result is who benefits. The gains go overwhelmingly to the least experienced and lowest-skilled agents, who improve by roughly 30-36%, while the most experienced, highest-skilled agents barely change, and on a couple of quality measures get slightly worse. New agents with the tool reach the productivity level that previously took several months of experience in about two months. This is notable because earlier waves of computing tended to help skilled workers most; here the pattern flips.
The authors push further into why. Agents who follow the AI's suggestions more closely gain more, and skeptical agents gradually start trusting it. Using accidental software outages, when the AI suddenly goes dark, they show that agents who had used the tool for a while still work faster than their pre-AI selves, which they read as genuine learning rather than mere dependence. The biggest gains appear not on the most common problems (which even novices already know) nor on the rarest (where the AI lacks training data), but on moderately uncommon problems. Using other AI models, they also find agents write more fluent, more 'native-sounding' English after adoption, especially overseas agents, and that low-skill agents' writing drifts to resemble high-skill agents' writing. Finally, customers behave better: their messages are warmer and they ask for a manager about 25% less often, and worker turnover falls, mostly among newer agents.
How much should we trust this? The study has real strengths. It is a large, granular, real-workplace dataset rather than a lab experiment; the productivity effect appears immediately and persists; and the authors stress-test it with several modern statistical estimators, an instrumental-variable approach, and a small embedded randomized pilot, all of which point the same way. They are unusually careful about caveats: they repeatedly say this is one tool, one firm, one job, that they cannot see wages or company-wide employment, and that the productivity numbers are short-to-medium-run.
The main reasons for caution are not errors so much as limits. Who got the AI and when was decided by managers, not by a clean lottery (the embedded pilot covers only about 50 workers and the authors lack data on its control group), so selection cannot be fully ruled out, though the IV and event-study checks reduce the worry. Several of the most interesting mechanisms, the outage-based 'learning' result, the English-fluency and writing-convergence findings, are explicitly described by the authors as suggestive: outages are rare and noisy, and the fluency and sentiment scores are produced by other AI models (Gemini, SiEBERT) whose own biases are hard to audit. Crucially, the underlying chat data are not shared; only replication code is posted on the Harvard Dataverse, so independent researchers cannot re-run the analysis on the raw data. And the firm, the AI vendor, and even individual pay are undisclosed, with the study period being 2020-2021 (an early GPT-3 system), all of which limit external scrutiny and generalization. In short: a careful, important, appropriately hedged study whose central productivity finding is well supported, but whose more speculative mechanisms and whose non-shared underlying data warrant the measured tone the authors themselves adopt.
Central claims & evidence map
| Claim | Type | Evidence offered | Support | Overclaiming | Main weakness |
|---|---|---|---|---|---|
| Access to the GPT-3-based AI assistant increased worker productivity, measured as resolutions per hour, by 15% on average (0.30 resolutions/hour off a pretreatment mean of ~1.97). | Causal | Two-way fixed-effects difference-in-differences with agent, year-month, location and agent-tenure fixed effects (Table II, col. 3: 0.301, SE 0.0329, p<0.01); immediate, persistent Sun-Abraham event-study (Figure II); robustness across de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna, Borusyak et al. estimators and an IV using team/office first-adoption dates. | Strong | None | Identification rests on non-random, manager-determined rollout timing; parallel-trends and no-anticipation are assumed rather than guaranteed, and the embedded pilot (~50 workers, no control-group data) is too small to anchor the headline on its own. |
| Gains accrue disproportionately to less-experienced and lower-skilled agents (up to ~30-36% RPH for the lowest quintile), while the most skilled/experienced agents see negligible speed gains and small quality declines. | Descriptive | Heterogeneity by skill quintile (Figure III: lowest quintile +0.5 RPH, or 36%) and tenure (Figure IV), each controlling for the other dimension; experience-curve plot (Figure V) showing treated novices reach veteran productivity in ~2 months; mean-reversion check (Online Appendix Figure A.VII). | Strong | Minor | Skill is measured by a pre-period performance index, so 'low skill' partly reflects transient low performance; the mean-reversion check is graphical/tercile-based rather than a formal test, leaving some residual regression-to-the-mean concern for the magnitude (not the sign). |
| Productivity gains partly reflect durable worker learning rather than mere reliance on the AI, evidenced by agents continuing to work faster during AI outages. | Causal | Chat-level event studies restricting to software-outage windows (Figure VII): exposed agents handle chats 15-25% faster than their pre-AI baseline even when AI is unavailable, with the effect growing with months of prior exposure and concentrated among high-adherence agents. | Moderate | Minor | Outages are rare, the estimates are noisy, and chats handled during outages may differ in composition from non-outage chats (the authors acknowledge this); 'learning' is also confounded with selection into adherence. The conclusion's framing that gains in part reflect durable worker learning is reasonable but rests on the weakest-powered evidence in the paper. |
| AI assistance improved agents' written English fluency/comprehensibility (especially for Philippines-based agents) and caused low-skill agents' writing to converge toward high-skill agents' writing. | Descriptive | Gemini-scored comprehensibility and 'native fluency' (1-5 scales) event studies (Figure IX); cosine-similarity/textual-embedding convergence analysis (Online Appendix Figure A.XVI) showing high-low skill similarity rising 0.55->0.61. | Weak | Minor | Outcomes are generated by other LLMs (Gemini) and embedding models whose scoring biases are not independently validatable; 'native fluency' as an outcome (defined via the Interagency Language Roundtable 'functionally native' standard) is normatively loaded; convergence may partly reflect changing chat topics rather than worker style. These are measurement-dependent, not behaviorally clean, outcomes. |
| AI assistance improved the experience of work: customer sentiment rose ~0.5 SD, requests to speak to a manager fell ~25%, and worker attrition fell (~40% among agents with <6 months tenure). | Causal | SiEBERT sentiment DiD (Table IV: customer sentiment +0.177, p<0.01, equivalent to about half a standard deviation; manager requests -0.00875, p<0.01, ~25% off a ~6% baseline); attrition analysis (Online Appendix Figure A.XVIII: ~10pp = 40% off a 25% baseline for <6-month agents). | Moderate | Minor | Sentiment is an LLM-derived proxy (SiEBERT, a RoBERTa checkpoint), not a validated customer outcome; the attrition analysis cannot include agent fixed effects because attrition happens once per worker, so the authors explicitly warn it should be taken with more caution than the main productivity results. |
Per-claim assessment
C1. Access to the GPT-3-based AI assistant increased worker productivity, measured as resolutions per hour, by 15% on average (0.30 resolutions/hour off a pretreatment mean of ~1.97).
The flagship effect is well identified for this setting. The estimate is stable as fixed effects are added (falling from 23.9% to 15.2%), survives multiple modern staggered-adoption estimators and an IV that addresses manager selection, and shows an immediate post-adoption jump with flat pre-trends. The 15% headline is the conservative, fully-controlled number, not the raw gap.
C2. Gains accrue disproportionately to less-experienced and lower-skilled agents (up to ~30-36% RPH for the lowest quintile), while the most skilled/experienced agents see negligible speed gains and small quality declines.
The monotone novice-skews-high pattern is internally consistent across five outcomes and two cross-cutting dimensions, and the authors directly test and largely rule out mechanical mean reversion. This is the paper's most novel and robust contrast with prior skill-biased-technical-change literature.
C3. Productivity gains partly reflect durable worker learning rather than mere reliance on the AI, evidenced by agents continuing to work faster during AI outages.
A clever natural experiment, and the directional pattern (growing with exposure, concentrated in adherers) is consistent with learning. The authors explicitly flag it as noisy and note that outages are rare.
C4. AI assistance improved agents' written English fluency/comprehensibility (especially for Philippines-based agents) and caused low-skill agents' writing to converge toward high-skill agents' writing.
Directionally plausible and consistent with the tacit-knowledge-diffusion story. The authors themselves label the convergence analysis 'only suggestive' and caution it can reflect customer-driven topic shifts.
C5. AI assistance improved the experience of work: customer sentiment rose ~0.5 SD, requests to speak to a manager fell ~25%, and worker attrition fell (~40% among agents with <6 months tenure).
Sentiment and escalation effects are precisely estimated and align with the productivity story. The authors appropriately caution that the attrition result is weaker.
Scorecard
Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.
What the paper does
The paper studies the staggered rollout of a GPT-3-based conversational assistant to 5,172 customer-support agents at a Fortune 500 firm that sells business-process software to small and medium U.S. businesses. The tool monitors live chats and offers real-time response suggestions; agents may ignore it. Identification leans on individual-level differences in adoption timing (rollout primarily fall 2020-winter 2021), with a small randomized pilot (~August 2020, ~50 workers) as supporting evidence.
The headline productivity result
Access to AI raises resolutions per hour by 0.30 (15.2%) off a pretreatment mean of 1.97 (Table II, col. 3). The estimate falls from 23.9% to 15.2% as agent, year-month, location and tenure fixed effects are added, and survives Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna and Borusyak et al. estimators plus an IV using team/office first-adoption dates. This is well identified for the setting.
Who benefits
Gains concentrate among novice and low-skill agents: the lowest skill quintile gains +0.5 RPH (36%), while the most skilled see no significant speed gain and small quality declines. Treated agents reach veteran productivity in roughly two months. The authors test and largely rule out mechanical mean reversion. This skill-leveling pattern is the paper's most novel contribution against the skill-biased-technical-change literature.
Mechanisms: adherence, learning, and rare problems
Higher adherence predicts larger gains, and adherence rises over time. Using rare software outages, exposed agents still work 15-25% faster than their pre-AI baseline, which the authors read as durable learning while flagging the estimates as noisy. Gains are largest for moderately rare problems, where humans have less baseline experience but the system still has adequate training data.
Language and convergence effects
Gemini-scored comprehensibility and 'native fluency' rise, more so for Philippines-based agents, and textual cosine similarity between high- and low-skill agents climbs from 0.55 to 0.61. The authors explicitly label the convergence analysis 'only suggestive,' noting it can reflect customer-driven topic shifts. These outcomes are LLM-generated and measurement-dependent.
Experience of work
Customer sentiment rises 0.177 points (~0.5 SD; SiEBERT), requests to speak to a manager fall ~25% off a ~6% baseline, and attrition falls ~40% off a 25% baseline among agents with under six months tenure. The attrition analysis omits agent fixed effects (attrition occurs once per worker), and the authors flag it as weaker than the main results.
Limits and external validity
The setting is one tool, one firm, one job. The data firm, the AI vendor, and individual pay are undisclosed; the period is an early (2020-2021) GPT-3 system. Replication code is posted on the Harvard Dataverse, but the raw chat data are not shared, so the analysis cannot be independently re-run on the underlying data. The randomized pilot is small and lacks control-group data.
Overall appraisal
A careful, important, and appropriately hedged study. The central productivity finding and the novice-skews-high heterogeneity are strongly supported; the learning, fluency/convergence, and sentiment mechanisms are more speculative and rest on LLM-derived or low-powered evidence, as the authors themselves acknowledge. The main residual concerns are non-random rollout timing and the inability of outsiders to audit the proprietary, non-shared underlying data.
Strongest critique
The most consequential causal claims beyond the headline -- durable 'learning' from outages, and the customer/communication effects -- depend on either very low-powered natural-experiment variation (rare, noisy outages whose chat composition may differ) or on outcomes manufactured by other AI models (Gemini fluency scores, SiEBERT sentiment, embedding-based convergence) whose biases cannot be independently validated; combined with manager-determined, non-random rollout timing and underlying data that outsiders cannot access, several of the paper's most cited secondary findings are less robust than their prominence in policy debate implies.
Strongest fair defence
The headline 15% productivity effect is conservatively estimated, stable as controls are added, robust across five modern staggered-adoption estimators and an IV addressing manager selection, and shows an immediate jump with flat pre-trends; the authors are unusually explicit about every limitation -- labeling the convergence analysis 'only suggestive,' flagging the outage estimates as noisy, and cautioning that the attrition result is weaker -- so the paper's claims are calibrated to its evidence rather than overclaimed.
Conclusion
A genuinely important, well-executed empirical study whose central productivity and heterogeneity findings are well supported and whose secondary mechanisms are appropriately hedged by the authors. The principal caveats are non-random rollout timing, reliance on LLM-derived secondary outcomes, and proprietary underlying data that cannot be independently audited. Severity moderate; publish.
Reply from the authors
Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.
Reply: not yet invited. No reply has been received for publication.
The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.
Editorial action after reply: Founding pilot: authors will be invited to reply once the standing board is ratified; this critique addresses claims and methods only, never the authors.
References
Every external source this Comment cites, each with a verified link. 0 fabricated.
Source-grounding attestation
- ✓Passes the publication validator — no errors
- ✓Zero fabricated citations — 0 fabricated
- ✓Severity within the access-basis cap — severity "moderate" ≤ cap "high" for open_access
- ✓External citations verified — 6/6 checked sources verified
Read at full text; grounding rests on the in-app citation-verification ledger (each external source URL checked).
Independent faithfulness review
A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.
Two adversarial refuters independently retrieved the real source through multiple channels — including the full published QJE PDF (the version this critique was written from), the NBER working paper, and arXiv v2 — and pressed every load-bearing claim for overreach (refuter 1) and mischaracterization (refuter 2). Neither sustained a misreading. Every headline magnitude checks out against the published text: the conservative fully-controlled 15% (0.301 off a 1.97 pretreatment mean), the 30-36% low-skill band with negligible/slightly-negative effects for top performers, the AI-outage durable-learning evidence, the 0.55->0.61 writing convergence, and the customer-experience results (+0.177 / ~0.5 SD sentiment, ~25% fewer manager requests, ~40% lower attrition off a 25% baseline). Crucially, the critique consistently rates support AT or BELOW the paper's own evidence (C3 "moderate," C4 "weak"), preserves the authors' qualifiers ("suggestive," "noisy," "more caution"), and foregrounds the exact limitations the authors themselves concede — non-random manager-determined rollout, the ~50-worker pilot with no control group, and attrition that cannot carry agent fixed effects. The only discrepancy either refuter surfaced — Gemini-scored fluency and ILR "native fluency" framing not appearing in the working paper — is version drift confirmed present in the published QJE source, not a misread, and the critique's underlying measurement-skepticism point holds in both versions. Verdict: faithful.
Version & correction history
| Version | Date | Change |
|---|---|---|
| v1.0 | 2026-06-15 | Initial publication. |
No silent substantive corrections — every change is versioned and visible.
How to cite this Comment
Critical AI. Comment on “Generative AI at Work” (Erik Brynjolfsson et al., The Quarterly Journal of Economics, 2025). Critical AI; 2026. https://policywindow.org/critique/c/brynjolfsson-li-raymond-generative-ai-at-work-qje-2025
A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.