{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000002","slug":"brynjolfsson-li-raymond-generative-ai-at-work-qje-2025","url":"https://policywindow.org/critique/c/brynjolfsson-li-raymond-generative-ai-at-work-qje-2025","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-06-15","current_version":"1.0","target_paper":{"title":"Generative AI at Work","authors":["Erik Brynjolfsson","Danielle Li","Lindsey R. Raymond"],"journal":"The Quarterly Journal of Economics","publisher":"Oxford University Press","doi":"10.1093/qje/qjae044","url":"https://academic.oup.com/qje/article/140/2/889/7990658","publicationDate":"2025-02-04","paperType":"empirical","accessBasis":"open_access","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.1093/qje/qjae044"},"source_journal":{"tier":"S","rankingSources":["https://ideas.repec.org/top/top.journals.simple10.html","https://en.wikipedia.org/wiki/The_Quarterly_Journal_of_Economics","https://academic.oup.com/qje/article/140/2/889/7990658"],"rankingNote":"The Quarterly Journal of Economics is one of the canonical 'top-5' economics journals and ranks 1st on the RePEc IDEAS simple 10-year impact-factor list (IF 73.019, ahead of the American Economic Review at 52.053). It is a flagship, rigorously refereed general-interest economics journal published by Oxford University Press for the Harvard University Department of Economics (editors named in the paper: Lawrence Katz and Andrei Shleifer). Tier S."},"selection_provenance":{"id":"brynjolfsson-li-raymond-generative-ai-at-work-qje-2025","venue":"The Quarterly Journal of Economics","inMonitoredSet":true,"determinedTier":"S","recordedTier":"S","effectiveTier":"S","kind":"monitored","disclosed":true,"offListPeerReviewed":false},"selection":{"aiAgiCentralityScore":5,"societalRelevanceScore":5,"aiAgiCategories":["labour_markets","innovation_productivity_competition","human_AI_interaction","inequality_bias_fairness","knowledge_production"],"selectionReason":"This is among the most cited and policy-influential empirical studies of generative AI's effect on real-world labour productivity. It studies a GPT-3-based assistant deployed to 5,172 customer-support agents in a Fortune 500 firm, directly addressing AI-and-labour-markets, the skill distribution of AI gains (novices benefit most), human-AI interaction (adherence/learning), and knowledge production (tacit-knowledge diffusion and the long-run model-training feedback loop). Its findings are routinely cited in AI policy and labour debates, making post-publication scrutiny high-value."},"scores":{"aiAgiContribution":5,"evidentiarySupport":4,"methodologicalRisk":2.5,"overclaiming":1.5,"reproducibilityOrAuditability":2,"societalImpactRelevance":5,"severity":"moderate","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"This paper asks a question many people now care about: what happens to ordinary workers when a generative-AI tool is dropped into their daily job? The authors study 5,172 customer-support agents at a Fortune 500 software company. Most agents are based in the Philippines and answer technical questions from U.S. small-business owners over chat. The company rolled out an AI assistant built on OpenAI's GPT-3 that watches each live conversation and suggests, in real time, how the agent might reply, plus links to internal help documents. The agent stays in charge and can ignore the suggestions. Because the firm could only train and license a limited number of agents at a time, different agents got access in different months, which lets the authors compare the same workers before and after they got the tool, and against workers who never got it.\n\nThe headline result is a 15% average increase in productivity, measured as customer problems resolved per hour. The more striking result is who benefits. The gains go overwhelmingly to the least experienced and lowest-skilled agents, who improve by roughly 30-36%, while the most experienced, highest-skilled agents barely change, and on a couple of quality measures get slightly worse. New agents with the tool reach the productivity level that previously took several months of experience in about two months. This is notable because earlier waves of computing tended to help skilled workers most; here the pattern flips.\n\nThe authors push further into why. Agents who follow the AI's suggestions more closely gain more, and skeptical agents gradually start trusting it. Using accidental software outages, when the AI suddenly goes dark, they show that agents who had used the tool for a while still work faster than their pre-AI selves, which they read as genuine learning rather than mere dependence. The biggest gains appear not on the most common problems (which even novices already know) nor on the rarest (where the AI lacks training data), but on moderately uncommon problems. Using other AI models, they also find agents write more fluent, more 'native-sounding' English after adoption, especially overseas agents, and that low-skill agents' writing drifts to resemble high-skill agents' writing. Finally, customers behave better: their messages are warmer and they ask for a manager about 25% less often, and worker turnover falls, mostly among newer agents.\n\nHow much should we trust this? The study has real strengths. It is a large, granular, real-workplace dataset rather than a lab experiment; the productivity effect appears immediately and persists; and the authors stress-test it with several modern statistical estimators, an instrumental-variable approach, and a small embedded randomized pilot, all of which point the same way. They are unusually careful about caveats: they repeatedly say this is one tool, one firm, one job, that they cannot see wages or company-wide employment, and that the productivity numbers are short-to-medium-run.\n\nThe main reasons for caution are not errors so much as limits. Who got the AI and when was decided by managers, not by a clean lottery (the embedded pilot covers only about 50 workers and the authors lack data on its control group), so selection cannot be fully ruled out, though the IV and event-study checks reduce the worry. Several of the most interesting mechanisms, the outage-based 'learning' result, the English-fluency and writing-convergence findings, are explicitly described by the authors as suggestive: outages are rare and noisy, and the fluency and sentiment scores are produced by other AI models (Gemini, SiEBERT) whose own biases are hard to audit. Crucially, the underlying chat data are not shared; only replication code is posted on the Harvard Dataverse, so independent researchers cannot re-run the analysis on the raw data. And the firm, the AI vendor, and even individual pay are undisclosed, with the study period being 2020-2021 (an early GPT-3 system), all of which limit external scrutiny and generalization. In short: a careful, important, appropriately hedged study whose central productivity finding is well supported, but whose more speculative mechanisms and whose non-shared underlying data warrant the measured tone the authors themselves adopt.","claims":[{"id":"C1","text":"Access to the GPT-3-based AI assistant increased worker productivity, measured as resolutions per hour, by 15% on average (0.30 resolutions/hour off a pretreatment mean of ~1.97).","type":"causal","evidenceOffered":"Two-way fixed-effects difference-in-differences with agent, year-month, location and agent-tenure fixed effects (Table II, col. 3: 0.301, SE 0.0329, p<0.01); immediate, persistent Sun-Abraham event-study (Figure II); robustness across de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna, Borusyak et al. estimators and an IV using team/office first-adoption dates.","support":"strong","overclaiming":"none","assessment":"The flagship effect is well identified for this setting. The estimate is stable as fixed effects are added (falling from 23.9% to 15.2%), survives multiple modern staggered-adoption estimators and an IV that addresses manager selection, and shows an immediate post-adoption jump with flat pre-trends. The 15% headline is the conservative, fully-controlled number, not the raw gap.","mainWeakness":"Identification rests on non-random, manager-determined rollout timing; parallel-trends and no-anticipation are assumed rather than guaranteed, and the embedded pilot (~50 workers, no control-group data) is too small to anchor the headline on its own.","confidence":"high"},{"id":"C2","text":"Gains accrue disproportionately to less-experienced and lower-skilled agents (up to ~30-36% RPH for the lowest quintile), while the most skilled/experienced agents see negligible speed gains and small quality declines.","type":"descriptive","evidenceOffered":"Heterogeneity by skill quintile (Figure III: lowest quintile +0.5 RPH, or 36%) and tenure (Figure IV), each controlling for the other dimension; experience-curve plot (Figure V) showing treated novices reach veteran productivity in ~2 months; mean-reversion check (Online Appendix Figure A.VII).","support":"strong","overclaiming":"minor","assessment":"The monotone novice-skews-high pattern is internally consistent across five outcomes and two cross-cutting dimensions, and the authors directly test and largely rule out mechanical mean reversion. This is the paper's most novel and robust contrast with prior skill-biased-technical-change literature.","mainWeakness":"Skill is measured by a pre-period performance index, so 'low skill' partly reflects transient low performance; the mean-reversion check is graphical/tercile-based rather than a formal test, leaving some residual regression-to-the-mean concern for the magnitude (not the sign).","confidence":"high"},{"id":"C3","text":"Productivity gains partly reflect durable worker learning rather than mere reliance on the AI, evidenced by agents continuing to work faster during AI outages.","type":"causal","evidenceOffered":"Chat-level event studies restricting to software-outage windows (Figure VII): exposed agents handle chats 15-25% faster than their pre-AI baseline even when AI is unavailable, with the effect growing with months of prior exposure and concentrated among high-adherence agents.","support":"moderate","overclaiming":"minor","assessment":"A clever natural experiment, and the directional pattern (growing with exposure, concentrated in adherers) is consistent with learning. The authors explicitly flag it as noisy and note that outages are rare.","mainWeakness":"Outages are rare, the estimates are noisy, and chats handled during outages may differ in composition from non-outage chats (the authors acknowledge this); 'learning' is also confounded with selection into adherence. The conclusion's framing that gains in part reflect durable worker learning is reasonable but rests on the weakest-powered evidence in the paper.","confidence":"medium"},{"id":"C4","text":"AI assistance improved agents' written English fluency/comprehensibility (especially for Philippines-based agents) and caused low-skill agents' writing to converge toward high-skill agents' writing.","type":"descriptive","evidenceOffered":"Gemini-scored comprehensibility and 'native fluency' (1-5 scales) event studies (Figure IX); cosine-similarity/textual-embedding convergence analysis (Online Appendix Figure A.XVI) showing high-low skill similarity rising 0.55->0.61.","support":"weak","overclaiming":"minor","assessment":"Directionally plausible and consistent with the tacit-knowledge-diffusion story. The authors themselves label the convergence analysis 'only suggestive' and caution it can reflect customer-driven topic shifts.","mainWeakness":"Outcomes are generated by other LLMs (Gemini) and embedding models whose scoring biases are not independently validatable; 'native fluency' as an outcome (defined via the Interagency Language Roundtable 'functionally native' standard) is normatively loaded; convergence may partly reflect changing chat topics rather than worker style. These are measurement-dependent, not behaviorally clean, outcomes.","confidence":"medium"},{"id":"C5","text":"AI assistance improved the experience of work: customer sentiment rose ~0.5 SD, requests to speak to a manager fell ~25%, and worker attrition fell (~40% among agents with <6 months tenure).","type":"causal","evidenceOffered":"SiEBERT sentiment DiD (Table IV: customer sentiment +0.177, p<0.01, equivalent to about half a standard deviation; manager requests -0.00875, p<0.01, ~25% off a ~6% baseline); attrition analysis (Online Appendix Figure A.XVIII: ~10pp = 40% off a 25% baseline for <6-month agents).","support":"moderate","overclaiming":"minor","assessment":"Sentiment and escalation effects are precisely estimated and align with the productivity story. The authors appropriately caution that the attrition result is weaker.","mainWeakness":"Sentiment is an LLM-derived proxy (SiEBERT, a RoBERTa checkpoint), not a validated customer outcome; the attrition analysis cannot include agent fixed effects because attrition happens once per worker, so the authors explicitly warn it should be taken with more caution than the main productivity results.","confidence":"medium"}],"sections":[{"id":"overview","title":"What the paper does","body":"The paper studies the staggered rollout of a GPT-3-based conversational assistant to 5,172 customer-support agents at a Fortune 500 firm that sells business-process software to small and medium U.S. businesses. The tool monitors live chats and offers real-time response suggestions; agents may ignore it. Identification leans on individual-level differences in adoption timing (rollout primarily fall 2020-winter 2021), with a small randomized pilot (~August 2020, ~50 workers) as supporting evidence."},{"id":"headline","title":"The headline productivity result","body":"Access to AI raises resolutions per hour by 0.30 (15.2%) off a pretreatment mean of 1.97 (Table II, col. 3). The estimate falls from 23.9% to 15.2% as agent, year-month, location and tenure fixed effects are added, and survives Sun-Abraham, de Chaisemartin-D'Haultfoeuille, Callaway-Sant'Anna and Borusyak et al. estimators plus an IV using team/office first-adoption dates. This is well identified for the setting."},{"id":"heterogeneity","title":"Who benefits","body":"Gains concentrate among novice and low-skill agents: the lowest skill quintile gains +0.5 RPH (36%), while the most skilled see no significant speed gain and small quality declines. Treated agents reach veteran productivity in roughly two months. The authors test and largely rule out mechanical mean reversion. This skill-leveling pattern is the paper's most novel contribution against the skill-biased-technical-change literature."},{"id":"mechanisms","title":"Mechanisms: adherence, learning, and rare problems","body":"Higher adherence predicts larger gains, and adherence rises over time. Using rare software outages, exposed agents still work 15-25% faster than their pre-AI baseline, which the authors read as durable learning while flagging the estimates as noisy. Gains are largest for moderately rare problems, where humans have less baseline experience but the system still has adequate training data."},{"id":"communication","title":"Language and convergence effects","body":"Gemini-scored comprehensibility and 'native fluency' rise, more so for Philippines-based agents, and textual cosine similarity between high- and low-skill agents climbs from 0.55 to 0.61. The authors explicitly label the convergence analysis 'only suggestive,' noting it can reflect customer-driven topic shifts. These outcomes are LLM-generated and measurement-dependent."},{"id":"experience","title":"Experience of work","body":"Customer sentiment rises 0.177 points (~0.5 SD; SiEBERT), requests to speak to a manager fall ~25% off a ~6% baseline, and attrition falls ~40% off a 25% baseline among agents with under six months tenure. The attrition analysis omits agent fixed effects (attrition occurs once per worker), and the authors flag it as weaker than the main results."},{"id":"limits","title":"Limits and external validity","body":"The setting is one tool, one firm, one job. The data firm, the AI vendor, and individual pay are undisclosed; the period is an early (2020-2021) GPT-3 system. Replication code is posted on the Harvard Dataverse, but the raw chat data are not shared, so the analysis cannot be independently re-run on the underlying data. The randomized pilot is small and lacks control-group data."},{"id":"verdict-section","title":"Overall appraisal","body":"A careful, important, and appropriately hedged study. The central productivity finding and the novice-skews-high heterogeneity are strongly supported; the learning, fluency/convergence, and sentiment mechanisms are more speculative and rest on LLM-derived or low-powered evidence, as the authors themselves acknowledge. The main residual concerns are non-random rollout timing and the inability of outsiders to audit the proprietary, non-shared underlying data."}],"strongest_critique":"The most consequential causal claims beyond the headline -- durable 'learning' from outages, and the customer/communication effects -- depend on either very low-powered natural-experiment variation (rare, noisy outages whose chat composition may differ) or on outcomes manufactured by other AI models (Gemini fluency scores, SiEBERT sentiment, embedding-based convergence) whose biases cannot be independently validated; combined with manager-determined, non-random rollout timing and underlying data that outsiders cannot access, several of the paper's most cited secondary findings are less robust than their prominence in policy debate implies.","strongest_fair_defence":"The headline 15% productivity effect is conservatively estimated, stable as controls are added, robust across five modern staggered-adoption estimators and an IV addressing manager selection, and shows an immediate jump with flat pre-trends; the authors are unusually explicit about every limitation -- labeling the convergence analysis 'only suggestive,' flagging the outage estimates as noisy, and cautioning that the attrition result is weaker -- so the paper's claims are calibrated to its evidence rather than overclaimed.","final_judgment":"A genuinely important, well-executed empirical study whose central productivity and heterogeneity findings are well supported and whose secondary mechanisms are appropriately hedged by the authors. The principal caveats are non-random rollout timing, reliance on LLM-derived secondary outcomes, and proprietary underlying data that cannot be independently audited. Severity moderate; publish.","review_process":{"aiAgentsUsed":["claim_extraction","ai_agi_relevance","methods","statistics","reproducibility","literature_context","policy_impact","ethics_society","overclaiming","adversarial","author_defence","citation_integrity","legal_risk","plain_language","meta_review"],"reviewRounds":2,"humanEditor":{"name":"","role":"","approvalDate":"2026-06-15","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited","editorialActionAfterResponse":"Authors may reply at any time; replies are published alongside, and a reply flagging a factual error triggers automated re-evaluation and a versioned correction; this critique addresses claims and methods only, never the authors."},"versions":[{"version":"1.0","date":"2026-06-15","note":"Initial publication.","changeType":"initial"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Drafted and cross-examined by the synthetic-review roster, then verified by an independent adversarial pass that re-fetched the open-access paper and confirmed every bibliographic fact and external citation. Bibliographic record re-confirmed via Crossref. Severity calibrated to the open-access basis. Published autonomously on passing the automated integrity gate (no human editor).","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[{"label":"QJE article landing page (OUP)","url":"https://academic.oup.com/qje/article/140/2/889/7990658","verified":true},{"label":"DOI 10.1093/qje/qjae044","url":"https://doi.org/10.1093/qje/qjae044","verified":true},{"label":"NBER Working Paper No. 31161","url":"https://www.nber.org/papers/w31161","verified":true},{"label":"RePEc IDEAS simple 10-year impact-factor ranking (QJE 1st, IF 73.019)","url":"https://ideas.repec.org/top/top.journals.simple10.html","verified":true},{"label":"Wikipedia: The Quarterly Journal of Economics (top-5, OUP for Harvard)","url":"https://en.wikipedia.org/wiki/The_Quarterly_Journal_of_Economics","verified":true},{"label":"Replication code, Harvard Dataverse (Brynjolfsson, Li, Raymond 2024)","url":"https://doi.org/10.7910/DVN/FSV1X7","verified":true}],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Open-access paper; quoted sparingly under criticism/review. Critique targets claims, methods, identification and policy inference only — never author character or motive."}}}