Post-publication Comment · Critical AI
Comment on “AI-determined similarity increases likability and trustworthiness of human voices”
Critical AI · published 2026-07-01 · v1.0 · CRIT-000032
Concerning: Oliver Jaggy, Stephan Schwan, Hauke S. Meyerhoff · PLOS ONE · 2025-03-05
Why this paper was selected
Autonomous production cycle (psychology deepening); OA full-text critique via the G119-improved producer + 3-lens convergence gate.
AI/AGI centrality 3/5 · societal relevance 3/5 · source-journal note: Off-monitored: PLOS ONE is a peer-reviewed, gold open-access (CC BY) megajournal not in the journal's monitored top-tier list; critiqued from its verbatim open-access full text.
Summary
The paper runs five preregistered online experiments showing that an AI speaker-verification system's cosine-similarity measure moderately tracks how similar people judge two voices to be (including comparisons to one's own voice), and that voices computationally similar to a listener's own voice receive slightly higher likability and trust ratings. The empirical work is careful in real ways (preregistration, open data, test-retest reliability, attenuation correction with upper-bound flagging, replication in Exp 2), and the reported effect is genuine but small. The central problem is the gap between what the design can show and what the title/abstract claim: the study bearing on the flagship result (Experiment 5) is purely correlational — participants rate a fixed, pre-sampled panel of 100 same-gender speakers and ratings are regressed on each speaker's cosine similarity to the listener's own voice; similarity is never manipulated — yet the title says similarity 'increases' likability/trust and the abstract says similar voices 'increased' trust and likability, causal verbs the design cannot license, and the authors themselves concede that uncontrolled stimulus properties (audio quality, articulation, semantic content) 'reduced the internal validity of the experiments,' which is exactly the confound that a correlational similarity-rating regression cannot exclude. Two secondary, span-grounded issues: (a) the 'substantial proportion of the variance / practical relevance' framing leans on aggregated category-mean R² (0.97, 0.955) that discards within-category variance, while the real person-level effects are median Spearman rho ~0.15-0.16 and one key predictor (the trustworthiness linear cosine term) is non-significant at p = .08; and (b) the stimulus/participant base is narrow (male-only stimuli in Exp 1/2/4, same-gender comparisons throughout, all-German participants, German-only encoder) relative to the discussion's generalization to voice assistants, advertising, and 'political propaganda.'
Central claims & evidence map
| Claim | Type | Evidence offered | Support | Overclaiming | Main weakness |
|---|---|---|---|---|---|
| The title and abstract assert that AI-determined similarity CAUSES higher trust/likability ('increases', 'increased'), but Experiment 5 — the study bearing on this claim — is a purely observational/correlational design: participants rate a fixed panel of 100 pre-sampled same-gend | we observed that voices similar to one’s own voice increased trustworthiness and likability, whereas average voices did not elicit such effects | Moderate | Moderate | The title and abstract assert that AI-determined similarity CAUSES higher trust/likability ('increases', 'increased'), but Experiment 5 — the study bearing on this claim — is a purely observational/correlational design: participants rate a fixed panel of 100 pre-sampled same-gend | |
| The Experiment 5 conclusion that the models account for a 'substantial proportion of the variance' and have 'practical relevance' rests on aggregated regressions using only 10 category means as data points (R² = 0.97 for likeability, R² = 0.955 for trustworthiness). Collapsing th | the overall fit of the models underscores the practical relevance of voice similarity in shaping social perceptions | Moderate | Moderate | The Experiment 5 conclusion that the models account for a 'substantial proportion of the variance' and have 'practical relevance' rests on aggregated regressions using only 10 category means as data points (R² = 0.97 for likeability, R² = 0.955 for trustworthiness). Collapsing th | |
| The stimulus and participant base is narrow in ways that constrain the headline claim more than the discussion acknowledges: Experiments 1, 2, and 4 use only male speakers, the self-comparison experiments (3 and 5) restrict comparison stimuli to same-gender speakers, all particip | We used only male speakers in this study to simplify the experimental design and ensure consistent conditions. | Moderate | Minor | The stimulus and participant base is narrow in ways that constrain the headline claim more than the discussion acknowledges: Experiments 1, 2, and 4 use only male speakers, the self-comparison experiments (3 and 5) restrict comparison stimuli to same-gender speakers, all particip |
Per-claim assessment
CLAIM-001. The title and abstract assert that AI-determined similarity CAUSES higher trust/likability ('increases', 'increased'), but Experiment 5 — the study bearing on this claim — is a purely observational/correlational design: participants rate a fixed panel of 100 pre-sampled same-gend
refutes
CLAIM-002. The Experiment 5 conclusion that the models account for a 'substantial proportion of the variance' and have 'practical relevance' rests on aggregated regressions using only 10 category means as data points (R² = 0.97 for likeability, R² = 0.955 for trustworthiness). Collapsing th
weakens
CLAIM-003. The stimulus and participant base is narrow in ways that constrain the headline claim more than the discussion acknowledges: Experiments 1, 2, and 4 use only male speakers, the self-comparison experiments (3 and 5) restrict comparison stimuli to same-gender speakers, all particip
weakens
Scorecard
Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.
Strongest critique
The load-bearing weakness is the causal framing of a non-causal design. The title ('AI-determined similarity increases likability and trustworthiness of human voices') and abstract ('voices similar to one's own voice increased trustworthiness and likability') assert causation, but Experiment 5 never manipulates similarity: it samples a fixed panel of 100 same-gender speakers spanning a similarity range and correlates each speaker's cosine-similarity-to-listener with likability/trust ratings ('we employed the similarity category as the predictor'). This is a between-speaker observational association, so speaker-level confounds — audio quality, articulation proficiency, and semantic content, all of which the authors concede vary across their open-source stimuli and 'reduced the internal validity of the experiments' — can drive the effect. The paper's own body text uses the correct correlational register ('voices similar to one's own are perceived as more likable and trustworthy'), showing the authors know the design cannot license 'increases/increased'. This flaw survives adversarial refutation because no disclosure in the paper supplies the missing manipulation or rules out the confounds, and the effect is small (median rho ~0.15-0.16, with the trustworthiness linear cosine term non-significant at p = .08), so it cannot bear the causal and societal-risk weight ('political propaganda') the title/abstract place on it.
Strongest fair defence
The paper is methodologically conscientious in ways that deserve real credit and that blunt several possible criticisms. All five experiments were preregistered (OSF links given per experiment), anonymized data are openly deposited on figshare, and R analysis code is described as publicly available — a strong reproducibility posture. The authors do not merely assert their AI measure is valid: they establish it across two experiments, quantify human test-retest reliability (Mdn rho = 0.57) and inter-rater agreement (ICC(A,1) = 0.31), and apply an attenuation correction while explicitly flagging the corrected values as upper bounds — statistically responsible handling. They repeatedly and honestly characterize effects as 'modest'/'weak'/'small', report the non-significant trustworthiness cosine term (p = .08), and disclose key limitations (gender restriction, open-source stimulus variability, online-testing noise, and the diluting effect of spanning the full similarity range). Notably, on the beauty-in-averageness dimension the paper is genuinely careful and self-critical: it confronts that prior studies used statistical-averaging or morphing composites, raises the harmonics-to-noise-ratio artifact, flags a possibly 'insufficient number of particularly typical and atypical speakers' and semantic-content bias, and draws only a hedged conclusion ('there is little evidence for a beauty-in-average effect of voices') — so a construct-validity attack on its typicality operationalization is largely pre-empted and is left uncovered here rather than manufactured. The genuinely correlational body language also means the over-claim is arguably confined to the title/abstract rather than pervading the analysis, which is why the headline severity is moderate, not high.
Conclusion
This is a competent, transparent, preregistered study whose empirical contribution — that a lightweight d-vector cosine measure moderately tracks human voice-similarity judgments, including self-voice comparisons, and that self-similar voices attract slightly higher likability/trust ratings — is credible but small. Its one serious flaw is dimension-overclaiming: the title and abstract render a correlational, confound-exposed Experiment 5 in causal language ('increases'/'increased'), a claim the design cannot support and that the authors' own internal-validity admission undercuts. Because the body text uses the correct correlational register and the limitations are largely disclosed, the over-claim is bounded to the title/abstract on an otherwise-sound paper, making this moderate rather than high severity. Two secondary, span-grounded points — the aggregated-category-mean R² inflating apparent explained variance over a person-level rho of ~0.15-0.16 with a non-significant key term, and the same-gender/German-only sampling under-supporting the sweeping societal generalization — add coverage without carrying the headline.
Reply from the authors
Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.
Reply: not yet invited. No reply has been received for publication.
The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.
Source-grounding attestation
- ✓Verbatim source spans present in the critique — 3/3 provenance spans re-derived in the critique prose
- ✓Passes the publication validator — no errors
- ✓Zero fabricated citations — 0 fabricated
- ✓Severity within the access-basis cap — severity "moderate" ≤ cap "high" for open_access
Every verbatim span the critique relies on is re-derived in the prose in-app; span-in-source is re-verifiable offline (the abstract is re-fetched, not stored, per the no-reproduce policy).
Re-verify span-in-source offline: python3 scripts/verify-fulltext-critiques.py
Version & correction history
| Version | Date | Change |
|---|---|---|
| v1.0 | 2026-07-01 |
No silent substantive corrections — every change is versioned and visible.
How to cite this Comment
Critical AI. Comment on “AI-determined similarity increases likability and trustworthiness of human voices” (Oliver Jaggy et al., PLOS ONE, 2025). Critical AI; 2026. https://policywindow.org/critique/c/ai-voice-similarity-likability-trust
A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.
Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/ai-voice-similarity-likability-trust/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique ai-voice-similarity-likability-trust --live.
Content fingerprint f53fd993364a60ea (v1.0) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.