{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000020","slug":"inconsistent-advice-chatgpt-decision-making","url":"https://policywindow.org/critique/c/inconsistent-advice-chatgpt-decision-making","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-06-27","current_version":"1.0","target_paper":{"title":"Inconsistent advice by ChatGPT influences decision making in various areas","authors":["Shinnosuke Ikeda"],"journal":"Scientific Reports","doi":"10.1038/s41598-024-66821-4","url":"https://doi.org/10.1038/s41598-024-66821-4","publicationDate":"2024","paperType":"empirical","accessBasis":"open_access","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.1038/s41598-024-66821-4"},"source_journal":{"tier":"exception","rankingSources":["off-monitored: peer-reviewed journal (Nature Portfolio) not in the monitored determination; disclosed off-list"],"rankingNote":"Off-monitored: Scientific Reports is a peer-reviewed, gold open-access journal (CC BY 4.0) not in the journal's monitored top-tier determination; disclosed off-list. Critiqued at full text (PMC11233716)."},"selection_provenance":{"id":"inconsistent-advice-chatgpt-decision-making","venue":"Scientific Reports","inMonitoredSet":false,"determinedTier":null,"recordedTier":"exception","effectiveTier":"exception","kind":"off_list","disclosed":true,"offListPeerReviewed":true},"selection":{"aiAgiCentralityScore":3,"societalRelevanceScore":4,"aiAgiCategories":["human_AI_interaction"],"selectionReason":"Autonomous run toward benchmark quality (G90): the journal's first self-produced OPEN-ACCESS FULL-TEXT critique, deliberately engaging the coverage-benchmark blind-spot dimensions (identification, statistical inference) an abstract critique cannot reach; every span verbatim-grounded to the gold-OA full text (PMC11233716, CC BY 4.0)."},"scores":{"aiAgiContribution":3,"evidentiarySupport":2,"methodologicalRisk":4,"overclaiming":3,"reproducibilityOrAuditability":3,"societalImpactRelevance":4,"severity":"high","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"This paper claims ChatGPT advice changes people's decisions, but its main causal comparison pits one separately-collected no-advice sample (Study 1, n=111) against a different, larger sample (Study 2) — and the author concedes the two were surveyed under different conditions, which confounds the \"advice effect\" with study-context differences. Worse, Study 2 collected a clean within-subject \"what would you have chosen without the advice\" measure that could have identified the effect directly, yet the headline task analyses use the messier cross-study comparison. On top of this, two residual-analysis sentences claim statistically significant bias while citing \"(ps > 0.05)\" — the threshold for NON-significance — so the numbers as printed contradict the conclusions; and one of the two key regression models does not fit better than chance (p = 0.471) yet a coefficient inside it is still interpreted. These are the genuinely serious, defensible problems. The paper does deserve credit for openly posting data/materials, stating ethics approval, candidly reporting its null results, and flagging the context confound and cultural narrowness itself — so several dimensions the draft fired on are either author-disclosed or not cleanly groundable and should be dropped.","claims":[{"id":"C1","text":"ChatGPT advice causally influences people's moral decision-making across various areas (the paper's headline claim).","type":"causal","evidenceOffered":"The headline task analyses estimate the advice effect by comparing the responses without advice in Study 1 with those in Study 2. It is important to note that in Study 1, unlike in Study 2, the relevant questionnaire was administered concurrently with several other questionnaires.","support":"weak","overclaiming":"moderate","assessment":"The headline causal claim that advice changes decisions rests on a between-study, between-subjects comparison: the 'no advice' baseline is the separate Study 1 sample and the 'advice' conditions are the Study 2 sample. The author concedes verbatim that the data-collection context differed between the two studies, so the 2x5 chi-square cannot separate an advice effect from study/context/cohort differences. Critically, Study 2 also collected a within-subject counterfactual ('choice had the advice not been given') that would have identified the effect cleanly, yet the headline task analyses use the confounded cross-study contrast instead. This is the load-bearing inferential weakness behind the causal title.","mainWeakness":"The causal effect is estimated from a between-study comparison the author concedes is confounded by data-collection context, while a clean within-subject counterfactual is sidelined.","confidence":"high"},{"id":"C2","text":"A residual analysis localizes which conditions show statistically significant advice-driven bias.","type":"methodological","evidenceOffered":"a residual analysis was conducted, revealing significant bias in all three conditions (ps > 0.05), except for the no advice condition and the advising condition in which the AI recommended immediate rewards","support":"weak","overclaiming":"minor","assessment":"Two residual-analysis sentences assert statistically significant bias while citing '(ps > 0.05)' in support — p > 0.05 is the threshold for NON-significance, so as printed the inferential statements contradict their own supporting statistics. This is not a one-off: the same '(ps > 0.05)'-supporting-significance pattern appears for the trolley (switch) and the delayed value discounting tasks, while the gender task correctly reports '(ps < 0.01)', confirming the inconsistency is real and material rather than a single innocuous typo. Because the residual analysis is what localizes WHICH conditions drove each significant omnibus test, the as-printed direction undermines the per-condition conclusions.","mainWeakness":"As printed, the residual-analysis sentences cite '(ps > 0.05)' — the non-significance threshold — to support claims of significant bias, so the statistics contradict the conclusions.","confidence":"high"},{"id":"C3","text":"Personal Fear of Invalidity moderates susceptibility to expert advice, and trust in AI is unrelated.","type":"methodological","evidenceOffered":"The expert-advice model is interpreted despite McFadden R2 = 0.007, χ2 = 4.570, p = 0.471 (a fit no better than chance).","support":"weak","overclaiming":"minor","assessment":"Secondary statistical-inference problem: in the expert-advice logistic model the overall fit is non-significant (McFadden R2 = 0.007, chi2 = 4.570, p = 0.471), yet Personal Fear of Invalidity within that model (p = 0.044) is still interpreted as a real effect, and the abstract's 'not related to trust in AI' accepts a null from this underpowered, non-fitting model. The AI-advice model is significant overall (p = 0.015) but with a negligible McFadden R2 = 0.030, so the moderation effect sizes are tiny relative to the substantive framing.","mainWeakness":"A coefficient is interpreted inside a model that does not fit better than null (p = 0.471), and both moderation models carry negligible effect sizes.","confidence":"high"},{"id":"C4","text":"Inconsistent ChatGPT advice influences decision making 'in various areas' (the title/abstract breadth claim).","type":"causal","evidenceOffered":"The abstract itself hedges the breadth: although not all decisions were susceptible to influence, particularly those based on negative emotions","support":"weak","overclaiming":"moderate","assessment":"The title generalizes to 'various areas' and the abstract to broad causal 'influence' from four scripted vignette tasks, two of which showed NO advice effect (the bridge dilemma and the AI-immediate-reward discounting condition). The breadth claim is built partly on the confounded cross-study comparison and on negligible moderation effect sizes. This over-reach is partially mitigated because the abstract itself discloses 'although not all decisions were susceptible to influence,' so the discussion is more hedged than the title; the overclaim lives chiefly in the title/abstract framing.","mainWeakness":"The 'various areas' breadth rests on four scripted vignette tasks (two showing no effect), the confounded cross-study comparison, and negligible moderation effect sizes.","confidence":"high"}],"sections":[{"id":"what","title":"What the paper does","body":"A two-study (total n = 1925) online vignette experiment with a Japanese opt-in panel, testing whether ChatGPT advice shifts people's choices on moral/decision tasks, framed as an extension of an established advice-taking paradigm. It reports that ChatGPT advice influenced decisions similarly to expert advice in some tasks, with two null results candidly reported."},{"id":"identification","title":"Identification: the causal headline rests on a conceded confound","body":"The headline causal claim that advice changes decisions rests on a between-study, between-subjects comparison: the 'no advice' baseline is the separate Study 1 sample and the 'advice' conditions are the Study 2 sample. The author concedes verbatim that the data-collection context differed between the two studies, so the 2x5 chi-square cannot separate an advice effect from study/context/cohort differences. Critically, Study 2 also collected a within-subject counterfactual ('choice had the advice not been given') that would have identified the effect cleanly, yet the headline task analyses use the confounded cross-study contrast instead. This is the load-bearing inferential weakness behind the causal title."},{"id":"statistical-inference","title":"Statistical inference: significance reported against its own threshold","body":"Two residual-analysis sentences assert statistically significant bias while citing '(ps > 0.05)' in support — p > 0.05 is the threshold for NON-significance, so as printed the inferential statements contradict their own supporting statistics. This is not a one-off: the same '(ps > 0.05)'-supporting-significance pattern appears for the trolley (switch) and the delayed value discounting tasks, while the gender task correctly reports '(ps < 0.01)', confirming the inconsistency is real and material rather than a single innocuous typo. Because the residual analysis is what localizes WHICH conditions drove each significant omnibus test, the as-printed direction undermines the per-condition conclusions. Secondary statistical-inference problem: in the expert-advice logistic model the overall fit is non-significant (McFadden R2 = 0.007, chi2 = 4.570, p = 0.471), yet Personal Fear of Invalidity within that model (p = 0.044) is still interpreted as a real effect, and the abstract's 'not related to trust in AI' accepts a null from this underpowered, non-fitting model. The AI-advice model is significant overall (p = 0.015) but with a negligible McFadden R2 = 0.030, so the moderation effect sizes are tiny relative to the substantive framing."},{"id":"scope","title":"Scope and the 'various areas' breadth","body":"The title generalizes to 'various areas' and the abstract to broad causal 'influence' from four scripted vignette tasks, two of which showed NO advice effect (the bridge dilemma and the AI-immediate-reward discounting condition). The breadth claim is built partly on the confounded cross-study comparison and on negligible moderation effect sizes. This over-reach is partially mitigated because the abstract itself discloses 'although not all decisions were susceptible to influence,' so the discussion is more hedged than the title; the overclaim lives chiefly in the title/abstract framing."},{"id":"strengths","title":"What the paper does well","body":"This is a transparent, honestly-caveated exploratory replication-extension, and several of the draft's eight charges are answered by the paper itself. It is a declared extension of an established paradigm (Krügel et al.), and matching that study's per-condition n is a defensible, if not power-justified, basis for sample size. Ethics approval (Kanazawa University 2023-65) and informed consent are stated, and data plus materials are openly posted to OSF — above the field norm, which substantially blunts the reproducibility charge. The author uses hedged verbs ('influenced', 'impacted', 'affected') rather than mechanistic causal language, candidly reports the two null results (bridge dilemma; AI-immediate-reward discounting) instead of suppressing them, and explicitly flags both the cross-study context confound and the cultural narrowness as limitations. The randomized 16-condition assignment and the persuasiveness-equated stimulus set show genuine design care. A fair reading is that the disclosed-limitations, measurement, generalizability, and reproducibility dimensions the draft fired on are largely either credited or author-acknowledged — the real, kept problems are concentrated in identification, statistical inference, and the title/abstract overclaim, not spread across all eight dimensions."}],"strongest_critique":"The paper's identification strategy cannot support its causal headline. The advice effect for the headline task analyses is estimated by comparing a separately-collected no-advice sample (Study 1) against the Study 2 advice conditions, and the author explicitly concedes the data-collection context differed: 'in Study 1, unlike in Study 2, the relevant questionnaire was administered concurrently with several other questionnaires.' That single conceded difference confounds the treatment with the study, so the 2x5 chi-square cannot isolate an advice effect from context/cohort differences — and this is avoidable, because Study 2 had already collected a within-subject counterfactual ('choice had the advice not been given') that would have identified the effect cleanly. Layered on top is a statistical-inference failure that is verifiable on the page: two residual-analysis sentences justify claims of SIGNIFICANT bias with the parenthetical '(ps > 0.05)' — the non-significance threshold — while a parallel sentence for the gender task correctly reads '(ps < 0.01)', confirming the contradiction is real and pervasive rather than a lone typo. Compounding this, the expert-advice model does not fit better than chance (McFadden R2 = 0.007, chi2 = 4.570, p = 0.471) yet a coefficient inside it is interpreted, and both moderation models carry negligible effect sizes (R2 = 0.030 / 0.007). The result is a broad causal title resting on a conceded confound, near-zero effect sizes, and internally contradictory significance reporting.","strongest_fair_defence":"This is a transparent, honestly-caveated exploratory replication-extension, and several of the draft's eight charges are answered by the paper itself. It is a declared extension of an established paradigm (Krügel et al.), and matching that study's per-condition n is a defensible, if not power-justified, basis for sample size. Ethics approval (Kanazawa University 2023-65) and informed consent are stated, and data plus materials are openly posted to OSF — above the field norm, which substantially blunts the reproducibility charge. The author uses hedged verbs ('influenced', 'impacted', 'affected') rather than mechanistic causal language, candidly reports the two null results (bridge dilemma; AI-immediate-reward discounting) instead of suppressing them, and explicitly flags both the cross-study context confound and the cultural narrowness as limitations. The randomized 16-condition assignment and the persuasiveness-equated stimulus set show genuine design care. A fair reading is that the disclosed-limitations, measurement, generalizability, and reproducibility dimensions the draft fired on are largely either credited or author-acknowledged — the real, kept problems are concentrated in identification, statistical inference, and the title/abstract overclaim, not spread across all eight dimensions.","final_judgment":"Empirical and on-topic, with real strengths in transparency (OSF data and materials, stated ethics approval and consent) and candor about its null results — but the inferential backbone is weak in three precisely-grounded ways that survive refute-by-default scrutiny. (1) The headline causal claim depends on a between-study comparison the author admits is confounded by data-collection context, while the within-subject counterfactual that could have identified the effect cleanly is sidelined. (2) Two residual-analysis significance statements are printed with '(ps > 0.05)' — i.e., as written they contradict their own conclusions, and the inconsistency is confirmed by a parallel '(ps < 0.01)' sentence elsewhere, so this is material, not cosmetic. (3) One of the two key models does not fit better than null (p = 0.471) yet a coefficient inside it is interpreted, and both moderation effect sizes are negligible (McFadden R2 0.030 / 0.007). The remaining five draft dimensions over-fired: disclosed-limitations, sample_data, and reproducibility largely re-charge the same underlying defects already counted under identification and statistical inference (double-counting), while measurement and generalizability concerns are either author-disclosed or not cleanly span-groundable. The work is best read as exploratory and suggestive; correcting the residual-analysis reporting, identifying the advice effect from the within-subject counterfactual, and downgrading the causal/'various areas' language in the title and abstract would be needed before the central claims are credible.","review_process":{"aiAgentsUsed":["claim_extraction","ai_agi_relevance","methods","statistics","reproducibility","overclaiming","adversarial","author_defence","citation_integrity","plain_language","meta_review"],"reviewRounds":2,"humanEditor":{"name":"","role":"","approvalDate":"2026-06-27","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited","editorialActionAfterResponse":"Authors may reply at any time; replies are published alongside, and a reply flagging a factual error triggers automated re-evaluation and a versioned correction; this critique addresses claims, methods and inference only, never the author."},"versions":[{"version":"1.0","date":"2026-06-27","note":"Initial publication. First autonomously-produced open-access full-text critique (G90).","changeType":"initial"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Full-text critique: the gold-OA paper (CC BY 4.0) was read in full (verbatim text from PMC11233716), and every span the critique relies on was checked to be an exact substring of that text, independently re-verified against a fresh fetch. The target DOI resolves via Crossref (title + author + year identity-matched). Severity is set under the open-access access basis (cap lifted from the abstract-only 'moderate'). Characterization follows the journal's faithfulness discipline; the critique targets claims, methods and inference only, never the author.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[{"label":"DOI 10.1038/s41598-024-66821-4 (Crossref: title+author+year matched)","url":"https://doi.org/10.1038/s41598-024-66821-4","verified":true},{"label":"Full text (PMC11233716, CC BY 4.0) used for span verification","url":"https://pmc.ncbi.nlm.nih.gov/articles/PMC11233716/","verified":true}],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Gold open-access paper (CC BY 4.0) quoted sparingly under criticism/review. Critique targets the paper's claims, methods, identification and inference only — never the author."}}}