{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000003","slug":"the-cybernetic-teammate-a-field-experiment-on-gene","url":"https://policywindow.org/critique/c/the-cybernetic-teammate-a-field-experiment-on-gene","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-06-30","current_version":"2.0","target_paper":{"title":"The Cybernetic Teammate: A Field Experiment on Generative AI and Teamwork","authors":["Fabrizio Dell’Acqua","Charles Ayoubi","Hila Lifshitz‐Assaf","Raffaella Sadun","Ethan Mollick","Lilach Mollick","Yi Han","Jeff Goldman"],"journal":"Organization Science","doi":"10.1287/orsc.2025.20702","url":"https://doi.org/10.1287/orsc.2025.20702","publicationDate":"2026-06-12","paperType":"empirical","accessBasis":"user_supplied","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.1287/orsc.2025.20702"},"source_journal":{"tier":"S","rankingSources":["https://doi.org/10.1287/orsc.2025.20702","https://openalex.org/W7164527973"],"rankingNote":"Organization Science (INFORMS) is a top-tier, FT50 management and organisation-theory journal. Tier S."},"selection_provenance":{"id":"the-cybernetic-teammate-a-field-experiment-on-gene","venue":"Organization Science","inMonitoredSet":true,"determinedTier":"S","recordedTier":"S","effectiveTier":"S","kind":"monitored","disclosed":true,"offListPeerReviewed":false},"selection":{"aiAgiCentralityScore":5,"societalRelevanceScore":5,"aiAgiCategories":["labour_markets","human_AI_interaction","innovation_productivity_competition"],"selectionReason":"A widely-discussed preregistered field experiment on generative AI and teamwork in a major firm; its results are already cited in debates about AI replacing collaboration, making the generalisation step worth scrutinising."},"scores":{"aiAgiContribution":5,"evidentiarySupport":3,"methodologicalRisk":3,"overclaiming":3,"reproducibilityOrAuditability":3,"societalImpactRelevance":5,"severity":"moderate","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"This is a strong, preregistered 2×2 field experiment (791 P&G professionals) with a credible design, transparent robustness checks, and unusually candid disclosure of limitations. The full text resolves most concerns the abstract alone raised. The one inferential over-reach that survives adversarial reading is the flagship \"AI matched teams\" claim: it is an equivalence statement supported by separate significance and visual similarity, but the authors never report the one pairwise test (Individual+AI vs Team No AI) or equivalence test that the claim logically requires — and their point estimates actually favor AI (0.37 vs 0.24). Secondary, lower-severity issues: the abstract elevates an admittedly exploratory emotion result to a co-equal \"pillar\"; self-reported emotion gains are attributed to \"the actual experience of working with AI\" in a non-blind design vulnerable to novelty/demand effects; and the human-selection-advantage framing rests on an untested 50% vs 37% gap. Calibrated severity: moderate.","claims":[{"id":"C1","text":"The paper's flagship finding is an EQUIVALENCE claim ('AI matched / was comparable to teams without AI'), but it is supported only by each arm being separately significant versus the control and by vi","type":"causal","evidenceOffered":"performed at a level comparable to teams without AI,","support":"weak","overclaiming":"moderate","assessment":"The paper's flagship finding is an EQUIVALENCE claim ('AI matched / was comparable to teams without AI'), but it is supported only by each arm being separately significant versus the control and by visual distribution overlap (Figure 3). No equivalence test (e.g., TOST) is reported, and — tellingly — Table 2 reports the pairwise p-value for Team+AI vs Team No AI (p=0.242) but NEVER reports the Individual+AI vs Team No AI difference test that the 'matched' claim actually requires. This is 'absence of a significant difference' being read as 'evidence of equivalence.' Worse, the point estimates do not support equivalence: Individual+AI = 0.373 SD vs Team No AI = 0.245 SD, i.e., AI is numerically HIGHER, so the data are at least as consistent with 'AI beats teams' as with 'AI equals teams.' The central organizational implication (AI can substitute for a human teammate) is built on an untested null.","mainWeakness":"The paper's flagship finding is an EQUIVALENCE claim ('AI matched / was comparable to teams without AI'), but it is supported only by each arm being separately significant versus the control and by vi","confidence":"high"},{"id":"C2","text":"The abstract presents emotional response as finding (3), a co-equal 'pillar' alongside performance and expertise","type":"descriptive","evidenceOffered":"AI’s language-based interface prompted more positive self-reported emotional","support":"moderate","overclaiming":"minor","assessment":"The abstract presents emotional response as finding (3), a co-equal 'pillar' alongside performance and expertise. The body, however, discloses that sociality was preregistered only 'with unclear effects' and that the emotional analysis 'emerged as a significant factor during the study' — i.e., it is an exploratory/emergent result, not a confirmatory preregistered hypothesis. Presenting an exploratory finding with the same evidentiary weight as the two confirmatory preregistered outcomes inflates its standing.","mainWeakness":"The abstract presents emotional response as finding (3), a co-equal 'pillar' alongside performance and expertise","confidence":"medium"},{"id":"C3","text":"The positive-emotion result is self-reported and the AI condition is inherently non-blind (participants knew they had AI, received AI-specific training, and used a novel tool)","type":"causal","evidenceOffered":"shifts attributable to the actual experience of working","support":"weak","overclaiming":"moderate","assessment":"The positive-emotion result is self-reported and the AI condition is inherently non-blind (participants knew they had AI, received AI-specific training, and used a novel tool). The paper attributes the observed positive shift to 'the actual experience of working with AI,' arguing that post-assignment baselines net out initial reactions to assignment. But a post-assignment baseline does not control for demand effects or the novelty/excitement of using a new AI tool DURING the task — exactly the period the emotion change captures. Self-reported affect toward a salient new technology in a non-blind upskilling program is a textbook setting for demand and novelty confounds, which the paper does not address for this outcome.","mainWeakness":"The positive-emotion result is self-reported and the AI condition is inherently non-blind (participants knew they had AI, received AI-specific training, and used a novel tool)","confidence":"high"},{"id":"C4","text":"The 'humans retain a selection advantage' narrative rests on a 50% (human teams) vs ~37% (AI) probability-of-picking-best-idea gap from Figure 8 panel (b), but no significance test or confidence inter","type":"descriptive","evidenceOffered":"demonstrate the strongest capability at identifying","support":"moderate","overclaiming":"minor","assessment":"The 'humans retain a selection advantage' narrative rests on a 50% (human teams) vs ~37% (AI) probability-of-picking-best-idea gap from Figure 8 panel (b), but no significance test or confidence interval for this specific difference is reported, and the relevant pairwise comparisons across these conditions are small-sample. The paper itself notes elsewhere (endnote 24) that incremental cross-condition differences 'are not statistically significant,' so presenting selection superiority as a robust human-vs-AI contrast outruns the reported evidence.","mainWeakness":"The 'humans retain a selection advantage' narrative rests on a 50% (human teams) vs ~37% (AI) probability-of-picking-best-idea gap from Figure 8 panel (b), but no significance test or confidence inter","confidence":"medium"}],"sections":[{"id":"what","title":"What the paper does","body":"A preregistered (AEARCTR-0013603) 2×2 field experiment at Procter & Gamble (791 professionals) randomising AI access and team vs solo work on real product-development tasks, with blinded expert evaluation. This critique is grounded in the full published text (Organization Science), provided to the editor as a licensed PDF."},{"id":"flaw1","title":"Strongest critique — the equivalence claim","body":"The paper's flagship finding is an EQUIVALENCE claim ('AI matched / was comparable to teams without AI'), but it is supported only by each arm being separately significant versus the control and by visual distribution overlap (Figure 3). No equivalence test (e.g., TOST) is reported, and — tellingly — Table 2 reports the pairwise p-value for Team+AI vs Team No AI (p=0.242) but NEVER reports the Individual+AI vs Team No AI difference test that the 'matched' claim actually requires. This is 'absence of a significant difference' being read as 'evidence of equivalence.' Worse, the point estimates do not support equivalence: Individual+AI = 0.373 SD vs Team No AI = 0.245 SD, i.e., AI is numerically HIGHER, so the data are at least as consistent with 'AI beats teams' as with 'AI equals teams.' The central organizational implication (AI can substitute for a human teammate) is built on an untested null."},{"id":"flaw2","title":"Exploratory result framed as a co-equal pillar","body":"The abstract presents emotional response as finding (3), a co-equal 'pillar' alongside performance and expertise. The body, however, discloses that sociality was preregistered only 'with unclear effects' and that the emotional analysis 'emerged as a significant factor during the study' — i.e., it is an exploratory/emergent result, not a confirmatory preregistered hypothesis. Presenting an exploratory finding with the same evidentiary weight as the two confirmatory preregistered outcomes inflates its standing."},{"id":"flaw3","title":"Self-reported emotion in a non-blind design","body":"The positive-emotion result is self-reported and the AI condition is inherently non-blind (participants knew they had AI, received AI-specific training, and used a novel tool). The paper attributes the observed positive shift to 'the actual experience of working with AI,' arguing that post-assignment baselines net out initial reactions to assignment. But a post-assignment baseline does not control for demand effects or the novelty/excitement of using a new AI tool DURING the task — exactly the period the emotion change captures. Self-reported affect toward a salient new technology in a non-blind upskilling program is a textbook setting for demand and novelty confounds, which the paper does not address for this outcome."},{"id":"flaw4","title":"Untested selection-advantage gap","body":"The 'humans retain a selection advantage' narrative rests on a 50% (human teams) vs ~37% (AI) probability-of-picking-best-idea gap from Figure 8 panel (b), but no significance test or confidence interval for this specific difference is reported, and the relevant pairwise comparisons across these conditions are small-sample. The paper itself notes elsewhere (endnote 24) that incremental cross-condition differences 'are not statistically significant,' so presenting selection superiority as a robust human-vs-AI contrast outruns the reported evidence."},{"id":"strengths","title":"What the paper does well","body":"This is a genuinely high-quality study and most abstract-era worries dissolve on full reading. It is preregistered (AEARCTR-0013603) with the performance and expertise hypotheses clearly designated primary and the emotion pillar honestly flagged as exploratory. The 2×2 design is embedded in real P&G product-development work with real stakes, blinded expert evaluators, multiple evaluations per solution, stratified randomization, and pre-treatment balance (Table 1). Robustness is unusually thorough: wild-cluster-bootstrap SEs at the eight randomization units (Table A3), AI-generated re-evaluations, alternative team-aggregation specifications, typo controls, and a composite-index replication. The authors repeatedly and explicitly hedge the very claims most open to attack — they call the 40/60 augment-vs-substitute split 'indicative rather than causal,' concede selection differences are not statistically significant (endnote 24), warn emotion results are 'immediate reactions' not long-term dynamics, and lay out scope conditions (single firm, one-day virtual flash teams, single AI model). The equivalence-claim critique is real but narrow: the distributions in Figure 3 are visibly similar and both arms clearly beat control, so the substantive story is directionally sound even if the 'matched' phrasing overstates the statistical warrant."}],"strongest_critique":"The paper's single most prominent claim — that AI-enabled individuals \"matched\" / performed \"comparable to\" two-person teams without AI — is an equivalence assertion unsupported by any equivalence test and not even backed by the relevant pairwise difference test. Table 2 reports the Team+AI vs Team No AI p-value (p=0.242) yet never reports the Individual+AI vs Team No AI comparison the headline depends on; meanwhile the point estimates (Individual+AI 0.373 SD vs Team No AI 0.245 SD) actually run in AI's favor, so the data are at least as consistent with \"AI exceeds teams\" as with \"AI equals teams.\" Inferring equivalence from separate-significance-plus-overlapping-distributions is the classic \"absence of evidence as evidence of absence\" error, and it is load-bearing for the paper's central organizational implication that AI can substitute for a human teammate.","strongest_fair_defence":"This is a genuinely high-quality study and most abstract-era worries dissolve on full reading. It is preregistered (AEARCTR-0013603) with the performance and expertise hypotheses clearly designated primary and the emotion pillar honestly flagged as exploratory. The 2×2 design is embedded in real P&G product-development work with real stakes, blinded expert evaluators, multiple evaluations per solution, stratified randomization, and pre-treatment balance (Table 1). Robustness is unusually thorough: wild-cluster-bootstrap SEs at the eight randomization units (Table A3), AI-generated re-evaluations, alternative team-aggregation specifications, typo controls, and a composite-index replication. The authors repeatedly and explicitly hedge the very claims most open to attack — they call the 40/60 augment-vs-substitute split 'indicative rather than causal,' concede selection differences are not statistically significant (endnote 24), warn emotion results are 'immediate reactions' not long-term dynamics, and lay out scope conditions (single firm, one-day virtual flash teams, single AI model). The equivalence-claim critique is real but narrow: the distributions in Figure 3 are visibly similar and both arms clearly beat control, so the substantive story is directionally sound even if the 'matched' phrasing overstates the statistical warrant.","final_judgment":"A methodologically strong, transparently reported, preregistered field experiment whose conclusions are mostly well-supported. The one over-reach that survives adversarial full-text refutation is inferential: the flagship 'AI matched teams' claim is an equivalence statement asserted without an equivalence test or the relevant pairwise comparison, and the point estimates do not actually favor equivalence. Three secondary, lower-severity issues (abstract elevation of an exploratory emotion result, demand/novelty exposure of non-blind self-reported affect, and an untested human-selection-advantage gap) are real but largely acknowledged in the body. Net: a moderate critique honestly stated — the paper is good, and its main vulnerability is rhetorical precision around equivalence rather than design or data integrity.","review_process":{"aiAgentsUsed":["claim_extraction","ai_agi_relevance","overclaiming","adversarial","author_defence","citation_integrity","legal_risk","plain_language","meta_review"],"reviewRounds":1,"humanEditor":{"name":"","role":"","approvalDate":"2026-06-15","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited","editorialActionAfterResponse":"Authors may reply at any time; replies are published alongside, and a reply flagging a factual error triggers automated re-evaluation and a versioned correction; this critique addresses claims, framing and generalisation only, never the authors."},"versions":[{"version":"1.0","date":"2026-06-15","note":"Initial publication.","changeType":"initial"},{"version":"2.0","date":"2026-06-30","note":"Upgraded from abstract-only to FULL-TEXT grounding (operator-provided licensed Organization Science PDF; accessBasis user_supplied). Re-critiqued against the verbatim full text and re-cleared the hardened convergence gate; the abstract-era external-validity flaws were WITHDRAWN as the full text resolves them, and the surviving critique now targets a span-exact statistical-inference over-reach (the 'AI matched teams' equivalence claim has no equivalence test) plus measurement/framing points the abstract could not reach.","changeType":"revision"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Full-text critique grounded in the operator-provided licensed publisher PDF (Organization Science; accessBasis user_supplied — re-verification requires source access). Every span is an exact substring of the stored full text; the critique cleared the hardened 3-lens convergence gate (stable survives-majority). Upgraded from the abstract-only v1.0: abstract-era external-validity flaws were withdrawn where the full text resolves them; the surviving critique targets the flagship equivalence over-reach and self-reported/exploratory framing. Targets claims/methods/inference only, never the authors.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[{"label":"DOI 10.1287/orsc.2025.20702 (Organization Science) — Crossref-verified title+authors+year","url":"https://doi.org/10.1287/orsc.2025.20702","verified":true},{"label":"Full text used for span verification (licensed publisher PDF, provided to the editor; not redistributable)","url":"https://doi.org/10.1287/orsc.2025.20702","verified":true}],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Licensed full text quoted sparingly under criticism/review; not redistributed (the PDF and extracted text are never committed). Targets claims/methods/inference only."}}}