Comment on "Positioning Political Texts with Large Language Models by Asking and Averaging"

Item: Positioning Political Texts with Large Language Models by Asking and Averaging
Author: Critical AI

Critical AI

Post-publication Comment · Critical AI

Comment on “Positioning Political Texts with Large Language Models by Asking and Averaging”

Critical AI · published 2026-07-05 · v1.0 · CRIT-000042

Concerning: Gaël Le Mens, Aina Gallego · Political Analysis · 2025

Severity: ModerateConfidence: HighTier AOpen-access full textEmpiricalRead the paper ↗

Why this paper was selected

Autonomous production cycle (political_science deepening); OA full-text critique via two-stage produce+sharpen + 3-lens convergence gate (2 survives, 1 weakened).

AI/AGI centrality 5/5 · societal relevance 4/5 · source-journal note: A-tier per the monitored-venue determination; Political Analysis is a methods-flagship journal in political science (ABDC A* / AJG 4*). Critiqued from the open-access version of record (CC BY, hybrid OA).

Summary

Le Mens and Gallego propose using instruction-tuned LLMs (GPT-4, Llama 3, MiXtral, Aya) to position political texts on ideological dimensions by directly asking for numeric scores and averaging responses. They validate across four tasks — US Congress tweets, senator positioning, UK party manifestos, and multilingual EU speeches — reporting correlations exceeding .90 with expert, crowdsourced, and roll-call benchmarks. The central critique is that the validation cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors, because every test case involves prominent politicians whose positions saturate LLM training data. The tweet-level task partially mitigates this concern by submitting individual tweets without author names, but the senator task aggregates to the actor level and the paper's claim of applicability to lesser-known actors is never empirically tested. Secondary concerns include exclusive reliance on correlation without calibration or formal statistical comparison, and small sample sizes (N=18, N=36) in two of four tasks.

Central claims & evidence map

Claim	Type	Evidence offered	Support	Overclaiming	Main weakness
The paper's validation cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors. The tweet-level task submits individual tweets without author names, partially addressing this concern, but the senator task aggregates tweet scores to the actor level, and the paper's claim of applicability to lesser-known actors is never empirically tested.	Methodological	to political actors about whom the LLM has little information.	Moderate	Moderate	The paper asserts that its method positions texts rather than recognized actors, but provides no experiment with unknown or fictional political actors to empirically isolate the text-content signal from actor-recognition retrieval.
Correlation is the sole validation metric; the paper never reports mean absolute error, calibration plots, or distributional comparisons, yet repeatedly uses the word 'accurate' to describe results that demonstrate only monotonic association.	Methodological	based on text coding by experts, crowdworkers, or roll call votes exceed .90.	Moderate	Moderate	Exclusive reliance on correlation conflates ordinal agreement with interval-level measurement accuracy, leaving calibration quality entirely unassessed.
Two of the four validation tasks use very small samples (18 British party manifestos, 36 EU speeches), making method comparisons highly imprecise, yet the paper draws comparative conclusions without reporting confidence intervals.	Descriptive	Wepositionedthe18Britishpartymanifestosonaneconomicpolicydimension	Moderate	Moderate	Method comparisons at N=18 and N=36 are too imprecise to reliably adjudicate whether LLMs match crowdsourced estimates or outperform supervised classifiers.
The abstract claims the approach is 'generally more accurate' than supervised classifiers, but no formal statistical test supports this comparison — all method comparisons rest on visual inspection of correlation values.	Descriptive	moreaccuratethanthepositionsobtainedwithsupervisedclassifierstrainedonlargeamountsofresearch	Moderate	Moderate	An untested superiority claim in the abstract overstates what visual inspection of correlations across four tasks (two with very small N) can support.

Per-claim assessment

CLAIM-001. The paper's validation cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors. The tweet-level task submits individual tweets without author names, partially addressing this concern, but the senator task aggregates tweet scores to the actor level, and the paper's claim of applicability to lesser-known actors is never empirically tested.
The paper claims to position texts rather than actors, but every empirical test uses prominent politicians from major Western democracies whose ideological profiles are extensively represented in LLM training corpora. The tweet-level analysis (Section 3.1) does validate individual tweets against crowdsourced ratings without revealing author names, providing some evidence of text-based positioning. However, the prompt explicitly states 'a tweet published by a member of the US Congress,' revealing the actor class, and the senator task (Section 3.2) explicitly averages tweet scores per senator, collapsing the text-versus-actor distinction. The claim of applicability to lesser-known actors is presented as a design property but is never empirically validated.
CLAIM-002. Correlation is the sole validation metric; the paper never reports mean absolute error, calibration plots, or distributional comparisons, yet repeatedly uses the word 'accurate' to describe results that demonstrate only monotonic association.
Correlation can exceed .90 even when position estimates are systematically shifted or compressed across the ideological spectrum. For applied measurement tasks such as tracking party movement over time or comparing positions across countries, rank-order preservation is insufficient — researchers need well-calibrated absolute positions. The paper uses 'accurate' throughout without distinguishing ordinal from interval-level accuracy.
CLAIM-003. Two of the four validation tasks use very small samples (18 British party manifestos, 36 EU speeches), making method comparisons highly imprecise, yet the paper draws comparative conclusions without reporting confidence intervals.
At N=18, a correlation of .90 has a 95% confidence interval of approximately [.75, .96]. Differences of .05 or even .10 in correlation between methods are within sampling noise at these sample sizes. The paper presents correlation differences across methods as meaningful without uncertainty quantification. These are established benchmark datasets from Benoit et al. (2016), so the small N is a property of the domain, but the imprecision of the comparison should be acknowledged.
CLAIM-004. The abstract claims the approach is 'generally more accurate' than supervised classifiers, but no formal statistical test supports this comparison — all method comparisons rest on visual inspection of correlation values.
The claim of general superiority over supervised classifiers appears in the abstract without any formal test, confidence interval, or effect-size measure. Given the small N in two of four tasks, observed correlation differences may be entirely attributable to sampling variability. The word 'generally' provides some hedging, but placing an untested superiority claim in the abstract elevates an informal observation to a headline finding.

Scorecard

AI/AGI contribution5.0 / 5

Evidentiary support4.0 / 5

Methodological risk3.0 / 5

Overclaiming4.0 / 5

Reproducibility / auditability3.0 / 5

Societal-impact relevance4.0 / 5

Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.

Strongest critique

The paper cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors. The tweet-level task partially mitigates this concern by submitting individual tweets without author names, but the senator task aggregates tweet scores to the actor level, making it functionally equivalent to actor positioning, and the paper's claim of applicability to 'political actors about whom the LLM has little information' is asserted as a design property but never empirically tested — no experiment uses texts from unknown or fictional political actors to isolate the text-content signal from actor-recognition retrieval.

Strongest fair defence

The paper is a concise, well-structured research letter that provides a replication package on Code Ocean and Dataverse, tests multiple open and closed LLMs across four distinct tasks spanning different text types and ten languages, uses a post-training-cutoff tweet dataset to partially address temporal contamination, demonstrates within-party differentiation at the tweet level (not just bloc-level separation), and concludes with an appropriately cautious call for case-by-case empirical validation. The authors explicitly recommend open LLMs for reproducibility and flag differential measurement error across languages. These are responsible methodological practices for a short research letter introducing a new approach.

Conclusion

This is a clearly written research letter introducing a practical and promising approach to text scaling with LLMs. However, its validation has a construct-validity gap: the paper cannot fully demonstrate that LLMs position texts based on content rather than recognized-actor associations, because every test case involves well-known politicians — though the tweet-level validation partially addresses this. Secondary concerns include exclusive reliance on correlation without calibration metrics, small sample sizes in two of four tasks, and an abstract-level superiority claim over supervised classifiers that lacks formal statistical support. The authors' own caveats about generalisability are appropriate, but the headline claims — particularly about applicability to lesser-known actors and general superiority over supervised methods — outrun the evidence.

Reply from the authors

Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.

Reply: not yet invited. No reply has been received for publication.

The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.

Source-grounding attestation

✓ attested in-appgrounding: spans in app

✓Verbatim source spans present in the critique — 4/4 provenance spans re-derived in the critique prose
✓Passes the publication validator — no errors
✓Zero fabricated citations — 0 fabricated
✓Severity within the access-basis cap — severity "moderate" ≤ cap "high" for open_access

Every verbatim span the critique relies on is re-derived in the prose in-app; span-in-source is re-verifiable offline (the abstract is re-fetched, not stored, per the no-reproduce policy).

Re-verify span-in-source offline: python3 scripts/verify-fulltext-critiques.py

Independent faithfulness review

A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.

✓ Faithful1/2 reviewers sustained a concern · source retrieved

All four verbatimSpans independently verified as exact substrings of the source text. DOI metadata confirmed via Crossref (Cambridge Core redirect). The critique heeds the refuter's weakening by softening the headline severity and acknowledging the tweet-level partial mitigation in both the claim text and the strongestCritique.

CLAIM-001 — Refuter notes tweet-level task partially mitigates the text-vs-actor concern; headline severity correctly softened from high to moderate.

Version & correction history

Version	Date	Change
v1.0	2026-07-05

No silent substantive corrections — every change is versioned and visible.

How to cite this Comment

Critical AI. Comment on “Positioning Political Texts with Large Language Models by Asking and Averaging” (Gaël Le Mens et al., Political Analysis, 2025). Critical AI; 2026. https://policywindow.org/critique/c/positioning-political-texts-llm-asking-averaging

A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.

Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/positioning-political-texts-llm-asking-averaging/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique positioning-political-texts-llm-asking-averaging --live.

Content fingerprint 3c00209532d8dd12 (v1.0) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.