{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000042","slug":"positioning-political-texts-llm-asking-averaging","url":"https://policywindow.org/critique/c/positioning-political-texts-llm-asking-averaging","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-07-05","current_version":"1.0","target_paper":{"title":"Positioning Political Texts with Large Language Models by Asking and Averaging","authors":["Gaël Le Mens","Aina Gallego"],"journal":"Political Analysis","doi":"10.1017/pan.2024.29","url":"https://doi.org/10.1017/pan.2024.29","publicationDate":"2025","paperType":"empirical","accessBasis":"open_access","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.1017/pan.2024.29"},"source_journal":{"tier":"A","rankingSources":["resolved from the monitored-venue determination"],"rankingNote":"A-tier per the monitored-venue determination; Political Analysis is a methods-flagship journal in political science (ABDC A* / AJG 4*). Critiqued from the open-access version of record (CC BY, hybrid OA)."},"selection_provenance":{"id":"positioning-political-texts-llm-asking-averaging","venue":"Political Analysis","inMonitoredSet":true,"determinedTier":"A","recordedTier":"A","effectiveTier":"A","kind":"monitored","disclosed":true,"offListPeerReviewed":false},"selection":{"aiAgiCentralityScore":5,"societalRelevanceScore":4,"aiAgiCategories":["ai_methods"],"selectionReason":"Autonomous production cycle (political_science deepening); OA full-text critique via two-stage produce+sharpen + 3-lens convergence gate (2 survives, 1 weakened).","domain":"political_science"},"scores":{"aiAgiContribution":5,"evidentiarySupport":4,"methodologicalRisk":3,"overclaiming":4,"reproducibilityOrAuditability":3,"societalImpactRelevance":4,"severity":"moderate","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"Le Mens and Gallego propose using instruction-tuned LLMs (GPT-4, Llama 3, MiXtral, Aya) to position political texts on ideological dimensions by directly asking for numeric scores and averaging responses. They validate across four tasks — US Congress tweets, senator positioning, UK party manifestos, and multilingual EU speeches — reporting correlations exceeding .90 with expert, crowdsourced, and roll-call benchmarks. The central critique is that the validation cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors, because every test case involves prominent politicians whose positions saturate LLM training data. The tweet-level task partially mitigates this concern by submitting individual tweets without author names, but the senator task aggregates to the actor level and the paper's claim of applicability to lesser-known actors is never empirically tested. Secondary concerns include exclusive reliance on correlation without calibration or formal statistical comparison, and small sample sizes (N=18, N=36) in two of four tasks.","claims":[{"id":"CLAIM-001","text":"The paper's validation cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors. The tweet-level task submits individual tweets without author names, partially addressing this concern, but the senator task aggregates tweet scores to the actor level, and the paper's claim of applicability to lesser-known actors is never empirically tested.","type":"methodological","evidenceOffered":"to political actors about whom the LLM has little information.","support":"moderate","overclaiming":"moderate","assessment":"The paper claims to position texts rather than actors, but every empirical test uses prominent politicians from major Western democracies whose ideological profiles are extensively represented in LLM training corpora. The tweet-level analysis (Section 3.1) does validate individual tweets against crowdsourced ratings without revealing author names, providing some evidence of text-based positioning. However, the prompt explicitly states 'a tweet published by a member of the US Congress,' revealing the actor class, and the senator task (Section 3.2) explicitly averages tweet scores per senator, collapsing the text-versus-actor distinction. The claim of applicability to lesser-known actors is presented as a design property but is never empirically validated.","mainWeakness":"The paper asserts that its method positions texts rather than recognized actors, but provides no experiment with unknown or fictional political actors to empirically isolate the text-content signal from actor-recognition retrieval.","confidence":"high"},{"id":"CLAIM-002","text":"Correlation is the sole validation metric; the paper never reports mean absolute error, calibration plots, or distributional comparisons, yet repeatedly uses the word 'accurate' to describe results that demonstrate only monotonic association.","type":"methodological","evidenceOffered":"based on text coding by experts, crowdworkers, or roll call votes exceed .90.","support":"moderate","overclaiming":"moderate","assessment":"Correlation can exceed .90 even when position estimates are systematically shifted or compressed across the ideological spectrum. For applied measurement tasks such as tracking party movement over time or comparing positions across countries, rank-order preservation is insufficient — researchers need well-calibrated absolute positions. The paper uses 'accurate' throughout without distinguishing ordinal from interval-level accuracy.","mainWeakness":"Exclusive reliance on correlation conflates ordinal agreement with interval-level measurement accuracy, leaving calibration quality entirely unassessed.","confidence":"high"},{"id":"CLAIM-003","text":"Two of the four validation tasks use very small samples (18 British party manifestos, 36 EU speeches), making method comparisons highly imprecise, yet the paper draws comparative conclusions without reporting confidence intervals.","type":"descriptive","evidenceOffered":"Wepositionedthe18Britishpartymanifestosonaneconomicpolicydimension","support":"moderate","overclaiming":"moderate","assessment":"At N=18, a correlation of .90 has a 95% confidence interval of approximately [.75, .96]. Differences of .05 or even .10 in correlation between methods are within sampling noise at these sample sizes. The paper presents correlation differences across methods as meaningful without uncertainty quantification. These are established benchmark datasets from Benoit et al. (2016), so the small N is a property of the domain, but the imprecision of the comparison should be acknowledged.","mainWeakness":"Method comparisons at N=18 and N=36 are too imprecise to reliably adjudicate whether LLMs match crowdsourced estimates or outperform supervised classifiers.","confidence":"moderate"},{"id":"CLAIM-004","text":"The abstract claims the approach is 'generally more accurate' than supervised classifiers, but no formal statistical test supports this comparison — all method comparisons rest on visual inspection of correlation values.","type":"descriptive","evidenceOffered":"moreaccuratethanthepositionsobtainedwithsupervisedclassifierstrainedonlargeamountsofresearch","support":"moderate","overclaiming":"moderate","assessment":"The claim of general superiority over supervised classifiers appears in the abstract without any formal test, confidence interval, or effect-size measure. Given the small N in two of four tasks, observed correlation differences may be entirely attributable to sampling variability. The word 'generally' provides some hedging, but placing an untested superiority claim in the abstract elevates an informal observation to a headline finding.","mainWeakness":"An untested superiority claim in the abstract overstates what visual inspection of correlations across four tasks (two with very small N) can support.","confidence":"moderate"}],"sections":[],"strongest_critique":"The paper cannot fully distinguish whether LLMs recover ideological positions from textual content or from memorized associations with well-known political actors. The tweet-level task partially mitigates this concern by submitting individual tweets without author names, but the senator task aggregates tweet scores to the actor level, making it functionally equivalent to actor positioning, and the paper's claim of applicability to 'political actors about whom the LLM has little information' is asserted as a design property but never empirically tested — no experiment uses texts from unknown or fictional political actors to isolate the text-content signal from actor-recognition retrieval.","strongest_fair_defence":"The paper is a concise, well-structured research letter that provides a replication package on Code Ocean and Dataverse, tests multiple open and closed LLMs across four distinct tasks spanning different text types and ten languages, uses a post-training-cutoff tweet dataset to partially address temporal contamination, demonstrates within-party differentiation at the tweet level (not just bloc-level separation), and concludes with an appropriately cautious call for case-by-case empirical validation. The authors explicitly recommend open LLMs for reproducibility and flag differential measurement error across languages. These are responsible methodological practices for a short research letter introducing a new approach.","final_judgment":"This is a clearly written research letter introducing a practical and promising approach to text scaling with LLMs. However, its validation has a construct-validity gap: the paper cannot fully demonstrate that LLMs position texts based on content rather than recognized-actor associations, because every test case involves well-known politicians — though the tweet-level validation partially addresses this. Secondary concerns include exclusive reliance on correlation without calibration metrics, small sample sizes in two of four tasks, and an abstract-level superiority claim over supervised classifiers that lacks formal statistical support. The authors' own caveats about generalisability are appropriate, but the headline claims — particularly about applicability to lesser-known actors and general superiority over supervised methods — outrun the evidence.","review_process":{"aiAgentsUsed":["AGISS critique engine (autonomous production cycle)"],"reviewRounds":1,"humanEditor":{"name":"","role":"","approvalDate":"","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited"},"versions":[{"version":"1.0","date":"2026-07-05","note":"","changeType":"initial"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Critique produced by the autonomous production cycle (two-stage produce+sharpen + 3-lens convergence gate, 2 survives / 1 weakened) and auto-published under the operator's auto-publish + post-audit model; the Mon/Thu audit is the post-hoc gate.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Political Analysis (Cambridge University Press, hybrid OA, CC BY) quoted sparingly under criticism/review; critique targets claims, methods and inference only."}}}