{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000015","slug":"theory-of-mind-llms-humans","url":"https://policywindow.org/critique/c/theory-of-mind-llms-humans","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-06-25","current_version":"1.1","target_paper":{"title":"Testing theory of mind in large language models and humans","authors":["James W. A. Strachan","Dalila Albergo","Giulia Borghini","Oriana Pansardi","Eugenio Scaliti","Saurabh Gupta","K. B. Saxena","Alessandro Rufo"],"journal":"Nature Human Behaviour","doi":"10.1038/s41562-024-01882-z","url":"https://doi.org/10.1038/s41562-024-01882-z","publicationDate":"2024-05-20","paperType":"empirical","accessBasis":"abstract_only","fullTextUsed":false,"fictional":false,"doi_url":"https://doi.org/10.1038/s41562-024-01882-z"},"source_journal":{"tier":"A","rankingSources":["resolved from the monitored-venue determination"],"rankingNote":"Tier S per the determination; ingested from an AGISS critique artifact."},"selection_provenance":{"id":"theory-of-mind-llms-humans","venue":"Nature Human Behaviour","inMonitoredSet":true,"determinedTier":"A","recordedTier":"S","effectiveTier":"A","kind":"monitored","disclosed":true},"selection":{"aiAgiCentralityScore":4,"societalRelevanceScore":4,"aiAgiCategories":[],"selectionReason":"Self-sourced by the program's research agenda (G86, psychology white-space); critique by the validated G84 engine, span-grounded to the OpenAlex abstract."},"scores":{"aiAgiContribution":4,"evidentiarySupport":4,"methodologicalRisk":3,"overclaiming":2,"reproducibilityOrAuditability":3,"societalImpactRelevance":4,"severity":"moderate","confidence":"medium"},"severity_cap_for_access_basis":"moderate","plain_language_summary":"Researchers gave a battery of \"theory of mind\" tests (understanding what others believe, getting indirect requests, spotting irony and social blunders) to two families of AI language models and to about 1,900 people, then compared scores. GPT-4 matched or beat people on several tests but did poorly at spotting social faux pas; LLaMA2 did the opposite. The authors argue GPT's weak spots came from being overly cautious about committing to an answer rather than truly not understanding, and that LLaMA2's apparent strength was a fluke of guessing \"they didn't know.\" Their bottom-line claim is carefully limited: the AI's behavior looks like the outputs people produce when reasoning about minds, but they do not claim the AI actually has a mind. The strongest caution is that for AI, \"being cautious\" can itself be a trained habit rather than a sign of intact reasoning, and the abstract does not give the numbers needed to tell these apart.","claims":[{"id":"CLAIM-001","text":"GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas","type":"empirical","evidenceOffered":"GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas","support":"moderate","overclaiming":"minor","assessment":"This is a behavioral performance comparison, and the abstract is careful to scope it to specific task categories rather than asserting global parity. The claim is well-hedged ('at, or even sometimes above') and names concrete sub-domains. The main interpretive risk is that 'human levels' on these structured tests measures task accuracy, not the underlying mentalistic process; matching output on a constrained battery does not establish that the same computation produced it. The dissociation (strong on three, weak on faux pas) is itself informative and argues against a trivial cueing explanation, but the abstract does not report effect sizes, confidence intervals, or whether differences were statistically tested against the 1,907-person distribution.","mainWeakness":"Equating task-accuracy parity with 'human levels' of theory of mind conflates output matching with process equivalence; no effect sizes or inferential statistics are reported in the abstract.","confidence":"medium"},{"id":"CLAIM-002","text":"the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference","type":"empirical","evidenceOffered":"the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference","support":"weak","overclaiming":"moderate","assessment":"This is the strongest causal-mechanistic claim in the abstract and the most vulnerable. Distinguishing a 'response bias' (hyperconservatism) from a 'genuine failure of inference' is a substantive cognitive-architecture claim, yet the evidence cited is a 'follow-up manipulation.' For humans, response conservatism vs. competence can be partially separated, but LLM outputs are heavily shaped by alignment/RLHF tuning that penalizes overcommitment, so the same surface pattern (declining to commit) could equally reflect a trained refusal style with no inferential content behind it. The abstract presents the conservatism account as established ('originated from') rather than as one consistent interpretation.","mainWeakness":"The bias-vs-inference dissociation is asserted as established but the abstract offers no mechanism for ruling out that 'hyperconservatism' is itself a trained output policy masking absent inference rather than intact-but-cautious inference.","confidence":"medium"},{"id":"CLAIM-003","text":"LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans","type":"empirical","evidenceOffered":"LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans","support":"strong","overclaiming":"none","assessment":"This concluding claim is commendably and precisely hedged: 'consistent with the outputs of' and 'behaviour' deliberately stop short of attributing mental states or genuine theory of mind to the models. This phrasing is defensible from a behavioral battery. The residual concern is that 'consistent with' is a weak logical relation (many non-mentalistic processes can produce consistent outputs), so the claim, while not overclaiming, is also less informative than a casual reader may infer. The paper appears aware of this, given its closing emphasis on 'non-superficial comparison.'","mainWeakness":"'Consistent with the outputs of' is a deliberately weak relation; it cannot discriminate mentalistic inference from pattern completion that yields the same answers, so the claim is safe but underdetermined.","confidence":"medium"}],"sections":[{"id":"s1","title":"Construct validity: human-normed tests on a text predictor","body":"The battery 'aim[s] to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas' — instruments developed and normed on humans. A core unstated assumption is that these tests measure the same latent ability in an LLM as in a person. An LLM may solve false-belief vignettes via statistical regularities in training text rather than via tracking a represented mental state, so equal accuracy demonstrates output equivalence, not process equivalence. The abstract's headline conclusion ('consistent with the outputs of mentalistic inference') respects this gap; the mid-abstract performance and mechanism claims lean closer to crossing it."},{"id":"s2","title":"The bias-vs-inference dissociation is the load-bearing and weakest claim","body":"Attributing GPT's faux-pas weakness to 'a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference' is a strong cognitive-architecture claim resting on 'follow-up manipulations of the belief likelihood.' In humans, response conservatism and competence can be partly separated. In alignment-tuned LLMs, a reluctance to commit may itself be a trained output policy with no preserved inference behind it — so the same surface pattern is consistent with both 'cautious-but-competent' and 'incompetent-and-trained-to-hedge.' The abstract presents the former as discovered cause ('originated from') without, in the abstract, reporting how the alternative was excluded."},{"id":"s3","title":"Statistical reporting and contamination not addressed in the abstract","body":"Models were 'tested repeatedly,' which is good practice, but the abstract reports no effect sizes, confidence intervals, or inferential comparisons against the 1,907-participant distribution, and no mention of prompt-format sensitivity or run-to-run variance. Phrases like 'at, or even sometimes above, human levels' are qualitative. For widely circulated ToM instruments, training-data contamination is also a live confound for GPT-4 that the abstract does not mention; the faux-pas dissociation partly mitigates a pure-memorization story but does not rule out contamination on the items where models excelled."},{"id":"s4","title":"What the abstract gets right","body":"The conclusion is precisely hedged ('behaviour that is consistent with the outputs of mentalistic inference in humans'), avoiding the common overclaim that LLMs 'have' theory of mind. The design is robust on several axes the field often neglects: multiple ToM constructs, two model families, repeated runs, and a large human sample. The within-paper dissociations and the 'follow-up manipulations' show active probing of alternative explanations, and the closing call for 'systematic testing to ensure a non-superficial comparison' is the right normative stance. These features materially raise the credibility ceiling relative to single-test, single-model demonstrations."}],"strongest_critique":"Read from the abstract alone, the fair concern is the asymmetry in how the two model results are stated. The GPT account is given as a discovered cause — \"the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference\" — whereas the LLaMA2 account is appropriately hedged (\"the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance\"). Because the abstract reports the dissociating follow-up manipulations of belief likelihood but not their effect sizes or inferential tests, an abstract-only reader can over-read \"originated from\" as fully established rather than manipulation-supported; that confidence asymmetry, visible only at the abstract's level of detail, is the narrow calibrated reservation. The deeper construct-validity question — whether human-normed theory-of-mind instruments measure the same latent ability in a text predictor — is one the abstract itself foregrounds in stressing \"the importance of systematic testing to ensure a non-superficial comparison,\" and the headline conclusion stays carefully scoped to behaviour that is \"consistent with the outputs of mentalistic inference in humans,\" not to mental states. So the strongest defensible critique is modest: an abstract-only reader should treat the GPT mechanism as manipulation-supported rather than settled, not infer any failure of the underlying comparison.","strongest_fair_defence":"The abstract is unusually disciplined for this contested topic. Its headline conclusion is scoped to \"behaviour that is consistent with the outputs of mentalistic inference\" — it explicitly does not claim the models have theory of mind or mental states, which is precisely the overclaim that plagues most LLM-cognition papers. The design choices are strong: a \"comprehensive battery\" spanning multiple distinct ToM constructs rather than a single test, two model families (GPT and LLaMA2) rather than one, repeated testing rather than single runs, and a large human comparison sample (1,907 participants). The observed dissociations — GPT strong on three sub-skills but weak on faux pas, LLaMA2 showing the opposite faux-pas pattern — are hard to explain by trivial confounds and demonstrate the battery has discriminating power. The follow-up \"manipulations of the belief likelihood\" show the authors actively probing alternative explanations rather than accepting surface scores, and the closing call for \"systematic testing to ensure a non-superficial comparison\" signals exactly the epistemic caution the field needs.","final_judgment":"Within abstract-only limits, this reads as a careful, well-designed behavioral comparison whose framing conclusion is appropriately hedged and whose multi-construct, multi-model, repeated-testing, large-human-sample design is a genuine strength. The principal calibrated concern is not the headline claim but the two mid-abstract mechanistic interpretations — that GPT's failures are \"hyperconservative\" response bias rather than absent inference, and that LLaMA2's faux-pas edge was \"illusory\" — which assert separable competence-from-style accounts that are harder to license for alignment-tuned text models than for humans, and which the abstract does not back with reported effect sizes or inferential tests. Construct validity (do human-normed ToM instruments measure the same latent ability in an LLM?) is the deeper unresolved issue, but the abstract's own emphasis on \"non-superficial comparison\" suggests the authors share it. Net: credible, modest, and self-aware in its top-line claim; somewhat overreaching in its causal-mechanistic sub-claims.","review_process":{"aiAgentsUsed":["AGISS critique engine (validated G84 directive)"],"reviewRounds":1,"humanEditor":{"name":"","role":"","approvalDate":"","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited"},"versions":[{"version":"1.0","date":"2026-06-25","note":"","changeType":"initial"},{"version":"1.1","date":"2026-06-25","note":"Self-audit (G87) found the strongest critique over-reached against a well-hedged abstract; narrowed to the defensible calibration concern. No claim quote changed.","changeType":"revision"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Self-sourced by the program's research agenda (G86, psychology white-space); critique by the validated G84 engine, span-grounded to the OpenAlex abstract.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[{"label":"Nature Human Behaviour abstract (OpenAlex)","url":"https://doi.org/10.1038/s41562-024-01882-z","verified":true}],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Critiques claims and methods only; no author-motive/misconduct language. Abstract-only; severity capped to moderate; fair-use of short abstract spans."}}}