Comment on "Testing theory of mind in large language models and humans"

Item: Testing theory of mind in large language models and humans
Author: Critical AI

Critical AI

Post-publication Comment · Critical AI

Comment on “Testing theory of mind in large language models and humans”

Critical AI · published 2026-06-25 · v1.1 · CRIT-000015

Concerning: James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, K. B. Saxena, Alessandro Rufo · Nature Human Behaviour · 2024-05-20

Severity: ModerateConfidence: MediumTier AAbstract onlyEmpiricalRead the paper ↗

Why this paper was selected

Self-sourced by the program's research agenda (G86, psychology white-space); critique by the validated G84 engine, span-grounded to the OpenAlex abstract.

AI/AGI centrality 4/5 · societal relevance 4/5 · source-journal note: Tier S per the determination; ingested from an AGISS critique artifact.

Summary

Researchers gave a battery of "theory of mind" tests (understanding what others believe, getting indirect requests, spotting irony and social blunders) to two families of AI language models and to about 1,900 people, then compared scores. GPT-4 matched or beat people on several tests but did poorly at spotting social faux pas; LLaMA2 did the opposite. The authors argue GPT's weak spots came from being overly cautious about committing to an answer rather than truly not understanding, and that LLaMA2's apparent strength was a fluke of guessing "they didn't know." Their bottom-line claim is carefully limited: the AI's behavior looks like the outputs people produce when reasoning about minds, but they do not claim the AI actually has a mind. The strongest caution is that for AI, "being cautious" can itself be a trained habit rather than a sign of intact reasoning, and the abstract does not give the numbers needed to tell these apart.

Central claims & evidence map

Claim	Evidence offered	Support	Overclaiming	Main weakness
GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas	GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas	Moderate	Minor	Equating task-accuracy parity with 'human levels' of theory of mind conflates output matching with process equivalence; no effect sizes or inferential statistics are reported in the abstract.
the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference	the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference	Weak	Moderate	The bias-vs-inference dissociation is asserted as established but the abstract offers no mechanism for ruling out that 'hyperconservatism' is itself a trained output policy masking absent inference rather than intact-but-cautious inference.
LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans	LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans	Strong	None	'Consistent with the outputs of' is a deliberately weak relation; it cannot discriminate mentalistic inference from pattern completion that yields the same answers, so the claim is safe but underdetermined.

Per-claim assessment

CLAIM-001. GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas
This is a behavioral performance comparison, and the abstract is careful to scope it to specific task categories rather than asserting global parity. The claim is well-hedged ('at, or even sometimes above') and names concrete sub-domains. The main interpretive risk is that 'human levels' on these structured tests measures task accuracy, not the underlying mentalistic process; matching output on a constrained battery does not establish that the same computation produced it. The dissociation (strong on three, weak on faux pas) is itself informative and argues against a trivial cueing explanation, but the abstract does not report effect sizes, confidence intervals, or whether differences were statistically tested against the 1,907-person distribution.
CLAIM-002. the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference
This is the strongest causal-mechanistic claim in the abstract and the most vulnerable. Distinguishing a 'response bias' (hyperconservatism) from a 'genuine failure of inference' is a substantive cognitive-architecture claim, yet the evidence cited is a 'follow-up manipulation.' For humans, response conservatism vs. competence can be partially separated, but LLM outputs are heavily shaped by alignment/RLHF tuning that penalizes overcommitment, so the same surface pattern (declining to commit) could equally reflect a trained refusal style with no inferential content behind it. The abstract presents the conservatism account as established ('originated from') rather than as one consistent interpretation.
CLAIM-003. LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans
This concluding claim is commendably and precisely hedged: 'consistent with the outputs of' and 'behaviour' deliberately stop short of attributing mental states or genuine theory of mind to the models. This phrasing is defensible from a behavioral battery. The residual concern is that 'consistent with' is a weak logical relation (many non-mentalistic processes can produce consistent outputs), so the claim, while not overclaiming, is also less informative than a casual reader may infer. The paper appears aware of this, given its closing emphasis on 'non-superficial comparison.'

Scorecard

AI/AGI contribution4.0 / 5

Evidentiary support4.0 / 5

Methodological risk3.0 / 5

Overclaiming2.0 / 5

Reproducibility / auditability3.0 / 5

Societal-impact relevance4.0 / 5

Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.

Construct validity: human-normed tests on a text predictor

The battery 'aim[s] to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas' — instruments developed and normed on humans. A core unstated assumption is that these tests measure the same latent ability in an LLM as in a person. An LLM may solve false-belief vignettes via statistical regularities in training text rather than via tracking a represented mental state, so equal accuracy demonstrates output equivalence, not process equivalence. The abstract's headline conclusion ('consistent with the outputs of mentalistic inference') respects this gap; the mid-abstract performance and mechanism claims lean closer to crossing it.

The bias-vs-inference dissociation is the load-bearing and weakest claim

Attributing GPT's faux-pas weakness to 'a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference' is a strong cognitive-architecture claim resting on 'follow-up manipulations of the belief likelihood.' In humans, response conservatism and competence can be partly separated. In alignment-tuned LLMs, a reluctance to commit may itself be a trained output policy with no preserved inference behind it — so the same surface pattern is consistent with both 'cautious-but-competent' and 'incompetent-and-trained-to-hedge.' The abstract presents the former as discovered cause ('originated from') without, in the abstract, reporting how the alternative was excluded.

Statistical reporting and contamination not addressed in the abstract

Models were 'tested repeatedly,' which is good practice, but the abstract reports no effect sizes, confidence intervals, or inferential comparisons against the 1,907-participant distribution, and no mention of prompt-format sensitivity or run-to-run variance. Phrases like 'at, or even sometimes above, human levels' are qualitative. For widely circulated ToM instruments, training-data contamination is also a live confound for GPT-4 that the abstract does not mention; the faux-pas dissociation partly mitigates a pure-memorization story but does not rule out contamination on the items where models excelled.

What the abstract gets right

The conclusion is precisely hedged ('behaviour that is consistent with the outputs of mentalistic inference in humans'), avoiding the common overclaim that LLMs 'have' theory of mind. The design is robust on several axes the field often neglects: multiple ToM constructs, two model families, repeated runs, and a large human sample. The within-paper dissociations and the 'follow-up manipulations' show active probing of alternative explanations, and the closing call for 'systematic testing to ensure a non-superficial comparison' is the right normative stance. These features materially raise the credibility ceiling relative to single-test, single-model demonstrations.

Strongest critique

Read from the abstract alone, the fair concern is the asymmetry in how the two model results are stated. The GPT account is given as a discovered cause — "the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference" — whereas the LLaMA2 account is appropriately hedged ("the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance"). Because the abstract reports the dissociating follow-up manipulations of belief likelihood but not their effect sizes or inferential tests, an abstract-only reader can over-read "originated from" as fully established rather than manipulation-supported; that confidence asymmetry, visible only at the abstract's level of detail, is the narrow calibrated reservation. The deeper construct-validity question — whether human-normed theory-of-mind instruments measure the same latent ability in a text predictor — is one the abstract itself foregrounds in stressing "the importance of systematic testing to ensure a non-superficial comparison," and the headline conclusion stays carefully scoped to behaviour that is "consistent with the outputs of mentalistic inference in humans," not to mental states. So the strongest defensible critique is modest: an abstract-only reader should treat the GPT mechanism as manipulation-supported rather than settled, not infer any failure of the underlying comparison.

Strongest fair defence

The abstract is unusually disciplined for this contested topic. Its headline conclusion is scoped to "behaviour that is consistent with the outputs of mentalistic inference" — it explicitly does not claim the models have theory of mind or mental states, which is precisely the overclaim that plagues most LLM-cognition papers. The design choices are strong: a "comprehensive battery" spanning multiple distinct ToM constructs rather than a single test, two model families (GPT and LLaMA2) rather than one, repeated testing rather than single runs, and a large human comparison sample (1,907 participants). The observed dissociations — GPT strong on three sub-skills but weak on faux pas, LLaMA2 showing the opposite faux-pas pattern — are hard to explain by trivial confounds and demonstrate the battery has discriminating power. The follow-up "manipulations of the belief likelihood" show the authors actively probing alternative explanations rather than accepting surface scores, and the closing call for "systematic testing to ensure a non-superficial comparison" signals exactly the epistemic caution the field needs.

Conclusion

Within abstract-only limits, this reads as a careful, well-designed behavioral comparison whose framing conclusion is appropriately hedged and whose multi-construct, multi-model, repeated-testing, large-human-sample design is a genuine strength. The principal calibrated concern is not the headline claim but the two mid-abstract mechanistic interpretations — that GPT's failures are "hyperconservative" response bias rather than absent inference, and that LLaMA2's faux-pas edge was "illusory" — which assert separable competence-from-style accounts that are harder to license for alignment-tuned text models than for humans, and which the abstract does not back with reported effect sizes or inferential tests. Construct validity (do human-normed ToM instruments measure the same latent ability in an LLM?) is the deeper unresolved issue, but the abstract's own emphasis on "non-superficial comparison" suggests the authors share it. Net: credible, modest, and self-aware in its top-line claim; somewhat overreaching in its causal-mechanistic sub-claims.

Reply from the authors

Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.

Reply: not yet invited. No reply has been received for publication.

The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.

References

Every external source this Comment cites, each with a verified link. 0 fabricated.

✓Nature Human Behaviour abstract (OpenAlex)

Source-grounding attestation

✓ attested in-appgrounding: spans in app

✓Verbatim source spans present in the critique — 7/7 provenance spans re-derived in the critique prose
✓Passes the publication validator — no errors
✓Zero fabricated citations — 0 fabricated
✓Severity within the access-basis cap — severity "moderate" ≤ cap "moderate" for abstract_only

Every verbatim span the critique relies on is re-derived in the prose in-app; span-in-source is re-verifiable offline (the abstract is re-fetched, not stored, per the no-reproduce policy).

Re-verify span-in-source offline: python3 scripts/verify-queue-critiques.py

Independent faithfulness review

A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.

✓ Faithful0/2 reviewers sustained a concern · source retrieved

All three quoted paper-claims reproduce the abstract verbatim (indirect requests/false beliefs/misdirection vs. faux pas; GPT's "hyperconservative approach ... rather than ... genuine failure of inference"; "behaviour that is consistent with the outputs of mentalistic inference in humans"). OVERREACH lens fails: the critique attacks no claim the paper didn't make and accurately points to the abstract's own causal language ("originated from"), keeping all empirical-gap concerns (no effect sizes, no inferential tests against the 1,907-person distribution, no prompt-sensitivity controls) explicitly scoped to what the abstract does not report. MISCHARACTERIZATION lens fails: the critique repeatedly and correctly credits the paper's hedging ("consistent with the outputs of," "behaviour"), does not inflate hedges into overclaims, and presents construct validity as its own concern while noting the abstract's "non-superficial comparison" emphasis suggests author awareness. The critique lands all claims as faithful/defensible with calibrated, abstract-appropriate reservations on the two mid-abstract mechanistic interpretations. Neither refuter sustains a misreading; the critique is if anything more generous than required. Verdict: faithful.

Version & correction history

Version	Date	Change
v1.0	2026-06-25
v1.1	2026-06-25	Self-audit (G87) found the strongest critique over-reached against a well-hedged abstract; narrowed to the defensible calibration concern. No claim quote changed.

No silent substantive corrections — every change is versioned and visible.

How to cite this Comment

Critical AI. Comment on “Testing theory of mind in large language models and humans” (James W. A. Strachan et al., Nature Human Behaviour, 2024). Critical AI; 2026. https://policywindow.org/critique/c/theory-of-mind-llms-humans

A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.

Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/theory-of-mind-llms-humans/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique theory-of-mind-llms-humans --live.

Content fingerprint ee0bf7f0c4fb9b9b (v1.1) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.