{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000032","slug":"ai-voice-similarity-likability-trust","url":"https://policywindow.org/critique/c/ai-voice-similarity-likability-trust","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-07-01","current_version":"1.0","target_paper":{"title":"AI-determined similarity increases likability and trustworthiness of human voices","authors":["Oliver Jaggy","Stephan Schwan","Hauke S. Meyerhoff"],"journal":"PLOS ONE","doi":"10.1371/journal.pone.0318890","url":"https://doi.org/10.1371/journal.pone.0318890","publicationDate":"2025-03-05","paperType":"empirical","accessBasis":"open_access","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.1371/journal.pone.0318890"},"source_journal":{"tier":"exception","rankingSources":["resolved from the monitored-venue determination"],"rankingNote":"Off-monitored: PLOS ONE is a peer-reviewed, gold open-access (CC BY) megajournal not in the journal's monitored top-tier list; critiqued from its verbatim open-access full text."},"selection_provenance":{"id":"ai-voice-similarity-likability-trust","venue":"PLOS ONE","inMonitoredSet":false,"determinedTier":null,"recordedTier":"exception","effectiveTier":"exception","kind":"off_list","disclosed":true,"offListPeerReviewed":true},"selection":{"aiAgiCentralityScore":3,"societalRelevanceScore":3,"aiAgiCategories":[],"selectionReason":"Autonomous production cycle (psychology deepening); OA full-text critique via the G119-improved producer + 3-lens convergence gate."},"scores":{"aiAgiContribution":3,"evidentiarySupport":3,"methodologicalRisk":3,"overclaiming":3,"reproducibilityOrAuditability":4,"societalImpactRelevance":3,"severity":"moderate","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"The paper runs five preregistered online experiments showing that an AI speaker-verification system's cosine-similarity measure moderately tracks how similar people judge two voices to be (including comparisons to one's own voice), and that voices computationally similar to a listener's own voice receive slightly higher likability and trust ratings. The empirical work is careful in real ways (preregistration, open data, test-retest reliability, attenuation correction with upper-bound flagging, replication in Exp 2), and the reported effect is genuine but small. The central problem is the gap between what the design can show and what the title/abstract claim: the study bearing on the flagship result (Experiment 5) is purely correlational — participants rate a fixed, pre-sampled panel of 100 same-gender speakers and ratings are regressed on each speaker's cosine similarity to the listener's own voice; similarity is never manipulated — yet the title says similarity 'increases' likability/trust and the abstract says similar voices 'increased' trust and likability, causal verbs the design cannot license, and the authors themselves concede that uncontrolled stimulus properties (audio quality, articulation, semantic content) 'reduced the internal validity of the experiments,' which is exactly the confound that a correlational similarity-rating regression cannot exclude. Two secondary, span-grounded issues: (a) the 'substantial proportion of the variance / practical relevance' framing leans on aggregated category-mean R² (0.97, 0.955) that discards within-category variance, while the real person-level effects are median Spearman rho ~0.15-0.16 and one key predictor (the trustworthiness linear cosine term) is non-significant at p = .08; and (b) the stimulus/participant base is narrow (male-only stimuli in Exp 1/2/4, same-gender comparisons throughout, all-German participants, German-only encoder) relative to the discussion's generalization to voice assistants, advertising, and 'political propaganda.'","claims":[{"id":"CLAIM-001","text":"The title and abstract assert that AI-determined similarity CAUSES higher trust/likability ('increases', 'increased'), but Experiment 5 — the study bearing on this claim — is a purely observational/correlational design: participants rate a fixed panel of 100 pre-sampled same-gend","type":"empirical","evidenceOffered":"we observed that voices similar to one’s own voice increased trustworthiness and likability, whereas average voices did not elicit such effects","support":"moderate","overclaiming":"moderate","assessment":"refutes","mainWeakness":"The title and abstract assert that AI-determined similarity CAUSES higher trust/likability ('increases', 'increased'), but Experiment 5 — the study bearing on this claim — is a purely observational/correlational design: participants rate a fixed panel of 100 pre-sampled same-gend","confidence":"high"},{"id":"CLAIM-002","text":"The Experiment 5 conclusion that the models account for a 'substantial proportion of the variance' and have 'practical relevance' rests on aggregated regressions using only 10 category means as data points (R² = 0.97 for likeability, R² = 0.955 for trustworthiness). Collapsing th","type":"empirical","evidenceOffered":"the overall fit of the models underscores the practical relevance of voice similarity in shaping social perceptions","support":"moderate","overclaiming":"moderate","assessment":"weakens","mainWeakness":"The Experiment 5 conclusion that the models account for a 'substantial proportion of the variance' and have 'practical relevance' rests on aggregated regressions using only 10 category means as data points (R² = 0.97 for likeability, R² = 0.955 for trustworthiness). Collapsing th","confidence":"high"},{"id":"CLAIM-003","text":"The stimulus and participant base is narrow in ways that constrain the headline claim more than the discussion acknowledges: Experiments 1, 2, and 4 use only male speakers, the self-comparison experiments (3 and 5) restrict comparison stimuli to same-gender speakers, all particip","type":"empirical","evidenceOffered":"We used only male speakers in this study to simplify the experimental design and ensure consistent conditions.","support":"moderate","overclaiming":"minor","assessment":"weakens","mainWeakness":"The stimulus and participant base is narrow in ways that constrain the headline claim more than the discussion acknowledges: Experiments 1, 2, and 4 use only male speakers, the self-comparison experiments (3 and 5) restrict comparison stimuli to same-gender speakers, all particip","confidence":"high"}],"sections":[],"strongest_critique":"The load-bearing weakness is the causal framing of a non-causal design. The title ('AI-determined similarity increases likability and trustworthiness of human voices') and abstract ('voices similar to one's own voice increased trustworthiness and likability') assert causation, but Experiment 5 never manipulates similarity: it samples a fixed panel of 100 same-gender speakers spanning a similarity range and correlates each speaker's cosine-similarity-to-listener with likability/trust ratings ('we employed the similarity category as the predictor'). This is a between-speaker observational association, so speaker-level confounds — audio quality, articulation proficiency, and semantic content, all of which the authors concede vary across their open-source stimuli and 'reduced the internal validity of the experiments' — can drive the effect. The paper's own body text uses the correct correlational register ('voices similar to one's own are perceived as more likable and trustworthy'), showing the authors know the design cannot license 'increases/increased'. This flaw survives adversarial refutation because no disclosure in the paper supplies the missing manipulation or rules out the confounds, and the effect is small (median rho ~0.15-0.16, with the trustworthiness linear cosine term non-significant at p = .08), so it cannot bear the causal and societal-risk weight ('political propaganda') the title/abstract place on it.","strongest_fair_defence":"The paper is methodologically conscientious in ways that deserve real credit and that blunt several possible criticisms. All five experiments were preregistered (OSF links given per experiment), anonymized data are openly deposited on figshare, and R analysis code is described as publicly available — a strong reproducibility posture. The authors do not merely assert their AI measure is valid: they establish it across two experiments, quantify human test-retest reliability (Mdn rho = 0.57) and inter-rater agreement (ICC(A,1) = 0.31), and apply an attenuation correction while explicitly flagging the corrected values as upper bounds — statistically responsible handling. They repeatedly and honestly characterize effects as 'modest'/'weak'/'small', report the non-significant trustworthiness cosine term (p = .08), and disclose key limitations (gender restriction, open-source stimulus variability, online-testing noise, and the diluting effect of spanning the full similarity range). Notably, on the beauty-in-averageness dimension the paper is genuinely careful and self-critical: it confronts that prior studies used statistical-averaging or morphing composites, raises the harmonics-to-noise-ratio artifact, flags a possibly 'insufficient number of particularly typical and atypical speakers' and semantic-content bias, and draws only a hedged conclusion ('there is little evidence for a beauty-in-average effect of voices') — so a construct-validity attack on its typicality operationalization is largely pre-empted and is left uncovered here rather than manufactured. The genuinely correlational body language also means the over-claim is arguably confined to the title/abstract rather than pervading the analysis, which is why the headline severity is moderate, not high.","final_judgment":"This is a competent, transparent, preregistered study whose empirical contribution — that a lightweight d-vector cosine measure moderately tracks human voice-similarity judgments, including self-voice comparisons, and that self-similar voices attract slightly higher likability/trust ratings — is credible but small. Its one serious flaw is dimension-overclaiming: the title and abstract render a correlational, confound-exposed Experiment 5 in causal language ('increases'/'increased'), a claim the design cannot support and that the authors' own internal-validity admission undercuts. Because the body text uses the correct correlational register and the limitations are largely disclosed, the over-claim is bounded to the title/abstract on an otherwise-sound paper, making this moderate rather than high severity. Two secondary, span-grounded points — the aggregated-category-mean R² inflating apparent explained variance over a person-level rho of ~0.15-0.16 with a non-significant key term, and the same-gender/German-only sampling under-supporting the sweeping societal generalization — add coverage without carrying the headline.","review_process":{"aiAgentsUsed":["produce","sharpen","gate_refute","gate_defender","gate_neutral"],"reviewRounds":1,"humanEditor":{"name":"","role":"","approvalDate":"","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited"},"versions":[{"version":"1.0","date":"2026-07-01","note":"","changeType":"initial"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Critique produced by the autonomous production cycle (G119 producer → 3-lens convergence gate); staged pending the operator-gated promotion (--promote) which runs the full automated integrity gate. No human editor.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"PLOS ONE (gold open access, CC BY 4.0) quoted sparingly under criticism/review; critique targets claims, methods and inference only — never the authors."}}}