{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000022","slug":"when-ai-judges-your-work","url":"https://policywindow.org/critique/c/when-ai-judges-your-work","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-06-28","current_version":"1.0","target_paper":{"title":"When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment","authors":["David Almog","Lucas Lippman","Daniel Martin"],"journal":"arXiv (working paper)","doi":"10.48550/arXiv.2603.02076","url":"https://arxiv.org/abs/2603.02076","publicationDate":"2026","paperType":"empirical","accessBasis":"open_access","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.48550/arXiv.2603.02076"},"source_journal":{"tier":"exception","rankingSources":["https://arxiv.org/abs/2603.02076","https://ar5iv.labs.arxiv.org/html/2603.02076"],"rankingNote":"An influential working paper (arXiv preprint, not peer-reviewed) critiqued at full text via ar5iv; disclosed off-list, tier 'exception' (preprint)."},"selection_provenance":{"id":"when-ai-judges-your-work","venue":"arXiv (working paper)","inMonitoredSet":false,"determinedTier":null,"recordedTier":"exception","effectiveTier":"exception","kind":"off_list","disclosed":true,"offListPeerReviewed":false},"selection":{"aiAgiCentralityScore":3,"societalRelevanceScore":4,"aiAgiCategories":["human_AI_interaction","labour_markets"],"selectionReason":"Test of /critical-ai-publish: OA full-text critique span-grounded to the ar5iv full text; engages identification + overclaim blind-spot dimensions."},"scores":{"aiAgiContribution":3,"evidentiarySupport":2,"methodologicalRisk":3,"overclaiming":3,"reproducibilityOrAuditability":3,"societalImpactRelevance":4,"severity":"moderate","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"A pre-registered, IRB-approved online experiment (N=208 participants, between-subjects, on Prolific) testing whether US workers change how they write image captions when told an AI versus a human will grade them. The design is genuinely strong in several respects: pre-registration, having BOTH ChatGPT and three PhD-student graders score EVERY caption regardless of treatment arm, fixed-rank (top-30%) incentives that neutralize beliefs about evaluator leniency, individual-clustered standard errors with image fixed effects, and a vivid cost comparison ($11.67 vs $6,480). The robust, clean finding is that AI assignment raises output quantity (caption length, +27.8% SD). The headline claim — that AI assessment lowers work QUALITY — is the fragile part, and it rests on two decisive problems. First, that conclusion is obtained only after 'controlling for output quantity,' but the treatment itself moves caption length (and pasting); conditioning on a treatment-moved mediator is a bad-control problem, so the length-adjusted contrast no longer estimates the total causal effect of evaluator assignment on quality. Second, the only quality measure with no model in the loop — raw human grades — runs the OPPOSITE way and is significant: captions written under AI assignment scored HIGHER (5.07 vs 4.97, p=0.0046). The 'lower quality' result is therefore not just control-dependent but contradicts the single model-free benchmark. The quantity finding and the cost comparison are credible; the quality headline is overstated.","claims":[{"id":"C1","text":"The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output q","type":"causal","evidenceOffered":"However, when controlling for output quantity, we find that the quality of captions declines when subjects are assigned to be judged by AI.","support":"weak","overclaiming":"moderate","assessment":"The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output quantity (caption length, +27.8% SD) — that is the companion headline finding — yet the quality conclusion is reached by 'controlling for output quantity.' Conditioning on a variable the treatment moves is a textbook bad-control problem: the length-adjusted within-quantity contrast no longer estimates the total causal effect of evaluator assignment on quality, and the authors adopt this length-holding-fixed estimand as the causal quantity of interest without arguing why it, rather than the total effect, is the right target. The total (unconditional) effect on the model-free human grade is in fact opposite-signed (see next flaw), which is exactly what a bad control can mask. Partial defense: Figure 7 shows the length-controlled sign is consistent for both graders, giving the adjusted claim internal coherence — but consistency across graders does not rescue the estimand choice.","mainWeakness":"The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output quantity (caption len","confidence":"high"},{"id":"C2","text":"The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assign","type":"causal","evidenceOffered":"captions evaluated in the ChatGPT evaluation treatment received a statistically significantly higher average grade (5.07 vs. 4.97, two-sided t-test, p-value = 0.0046 0.0046 ). This result is driven by two of the three graders.","support":"moderate","overclaiming":"moderate","assessment":"The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assignment received a HIGHER average human grade (5.07 vs 4.97, p=0.0046), driven by two of three graders. The abstract's claim that quality is 'lower... regardless of whether quality is measured using humans or LLM grades' holds only after the length/pasting adjustments flagged in the prior flaw; the unconditional, model-free human result points the other way. This is not concealment — the abstract discloses the 'controlling for quantity' qualifier and the paper reports the divergent grades openly — but the title/abstract framing presents the adjusted result as the general finding while a significant opposite-signed raw result sits in Section 3.2. For the one benchmark with no LLM in the loop, the headline direction reverses.","mainWeakness":"The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assignment received a HIGH","confidence":"high"},{"id":"C3","text":"Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth ","type":"methodological","evidenceOffered":"an additional 100 characters increased human-assigned grades by an estimated 1 point, whereas the effect was 0.5 points for ChatGPT","support":"weak","overclaiming":"moderate","assessment":"Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth +1 point for human graders and +0.5 for ChatGPT on the 0-9 scale. Because the same underlying variable defines the quantity outcome and dominates the quality scores, the two constructs are mechanically entangled. The 'more quantity, less quality' narrative is therefore partly an artifact of length entering both measures — the treatment raises length, length mechanically raises raw grades, and the 'quality decline' only appears once length is partialled back out — rather than two independent constructs trading off. This is the same length variable doing triple duty (treatment-affected outcome, dominant grade driver, and regression control), which is why the headline is so sensitive to the length adjustment.","mainWeakness":"Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth +1 point for human g","confidence":"high"}],"sections":[{"id":"what","title":"What the paper does","body":"A pre-registered, IRB-approved online experiment (N=208, between-subjects, Prolific) testing whether US workers change how they write image captions when told an AI vs a human will grade them. Both ChatGPT and three PhD-student graders scored every caption. The robust finding is higher output quantity under AI; the contested finding is lower quality."},{"id":"identification","title":"Identification: the quality headline is a bad-control result","body":"The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output quantity (caption length, +27.8% SD) — that is the companion headline finding — yet the quality conclusion is reached by 'controlling for output quantity.' Conditioning on a variable the treatment moves is a textbook bad-control problem: the length-adjusted within-quantity contrast no longer estimates the total causal effect of evaluator assignment on quality, and the authors adopt this length-holding-fixed estimand as the causal quantity of interest without arguing why it, rather than the total effect, is the right target. The total (unconditional) effect on the model-free human grade is in fact opposite-signed (see next flaw), which is exactly what a bad control can mask. Partial defense: Figure 7 shows the length-controlled sign is consistent for both graders, giving the adjusted claim internal coherence — but consistency across graders does not rescue the estimand choice."},{"id":"overclaim","title":"The model-free measure runs the other way","body":"The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assignment received a HIGHER average human grade (5.07 vs 4.97, p=0.0046), driven by two of three graders. The abstract's claim that quality is 'lower... regardless of whether quality is measured using humans or LLM grades' holds only after the length/pasting adjustments flagged in the prior flaw; the unconditional, model-free human result points the other way. This is not concealment — the abstract discloses the 'controlling for quantity' qualifier and the paper reports the divergent grades openly — but the title/abstract framing presents the adjusted result as the general finding while a significant opposite-signed raw result sits in Section 3.2. For the one benchmark with no LLM in the loop, the headline direction reverses."},{"id":"measurement","title":"Length doing triple duty","body":"Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth +1 point for human graders and +0.5 for ChatGPT on the 0-9 scale. Because the same underlying variable defines the quantity outcome and dominates the quality scores, the two constructs are mechanically entangled. The 'more quantity, less quality' narrative is therefore partly an artifact of length entering both measures — the treatment raises length, length mechanically raises raw grades, and the 'quality decline' only appears once length is partialled back out — rather than two independent constructs trading off. This is the same length variable doing triple duty (treatment-affected outcome, dominant grade driver, and regression control), which is why the headline is so sensitive to the length adjustment."},{"id":"strengths","title":"What the paper does well","body":"This is a genuinely well-executed experiment and several criticisms have real counterarguments. It is pre-registered (AsPredicted) and IRB-approved, with clean between-subjects randomization that neutralizes selection and ordering confounds. The decision to have BOTH ChatGPT and three PhD-student graders score EVERY caption regardless of treatment arm is exactly the right design for separating evaluator-source effects from treatment effects — and it is precisely this design that SURFACED the human/AI grade divergence, which the authors then report openly rather than bury. The fixed-rank (top-30%) incentive rules out the obvious confound of differential leniency beliefs. Main-effect inference is properly clustered at the individual level with image fixed effects, so the headline tests are not statistically naive. On the bad-control point, the authors plausibly care about a quality-holding-quantity-fixed contrast as a substantive question (do workers trade quantity for quality under AI?), and Figure 7 shows the length-controlled result is consistent for BOTH graders, giving the adjusted claim more internal coherence than the raw numbers alone suggest. The ChatGPT-grade result is reported as robust to temperature and to joint-vs-separate-criterion prompting (footnote 15), blunting reproducibility worries. The cost comparison ($11.67 vs $6,480) is a real, vividly documented contribution about adoption incentives. The quantity result is clean and credible. Much of the remaining dispute is about estimand choice and framing rather than analytic error."}],"strongest_critique":"The central claim — that telling workers an AI will judge them lowers the quality of their work — is not robustly supported by the paper's own data. The only quality benchmark with no model in the loop, raw human grades, goes the OPPOSITE way and is significant: captions produced under AI assignment scored HIGHER (5.07 vs 4.97, p=0.0046). The 'lower quality under AI' conclusion materializes only after 'controlling for output quantity' — but the treatment itself moves quantity (a +27.8% SD length effect is the companion headline). Conditioning on a treatment-moved mediator is a classic bad-control problem: the length-adjusted contrast no longer estimates the total causal effect of evaluator assignment on quality, and the authors never argue why a quality-holding-quantity-fixed comparison, rather than the total effect, is the estimand of interest. The mechanism is mechanical: length is simultaneously the 'quantity' outcome and the dominant driver of the grades (+1 point per 100 characters for humans), so the 'more quantity, less quality' story is partly an artifact of one variable doing triple duty as outcome, grade-driver, and control. A control-dependent headline that reverses on the single model-free measure is the principal weakness.","strongest_fair_defence":"This is a genuinely well-executed experiment and several criticisms have real counterarguments. It is pre-registered (AsPredicted) and IRB-approved, with clean between-subjects randomization that neutralizes selection and ordering confounds. The decision to have BOTH ChatGPT and three PhD-student graders score EVERY caption regardless of treatment arm is exactly the right design for separating evaluator-source effects from treatment effects — and it is precisely this design that SURFACED the human/AI grade divergence, which the authors then report openly rather than bury. The fixed-rank (top-30%) incentive rules out the obvious confound of differential leniency beliefs. Main-effect inference is properly clustered at the individual level with image fixed effects, so the headline tests are not statistically naive. On the bad-control point, the authors plausibly care about a quality-holding-quantity-fixed contrast as a substantive question (do workers trade quantity for quality under AI?), and Figure 7 shows the length-controlled result is consistent for BOTH graders, giving the adjusted claim more internal coherence than the raw numbers alone suggest. The ChatGPT-grade result is reported as robust to temperature and to joint-vs-separate-criterion prompting (footnote 15), blunting reproducibility worries. The cost comparison ($11.67 vs $6,480) is a real, vividly documented contribution about adoption incentives. The quantity result is clean and credible. Much of the remaining dispute is about estimand choice and framing rather than analytic error.","final_judgment":"A competent, transparent, pre-registered experiment whose causal infrastructure (randomization, dual grading of every caption, leniency-neutralizing incentives, individual-clustered SEs with image fixed effects) is solid, and whose quantity result and cost-of-grading comparison are credible. The headline 'AI assessment lowers work quality' claim, however, is overstated. It is fragile in two decisive and tightly linked ways: it is obtained only by conditioning on output quantity, a mediator the treatment itself moves (a bad-control problem that makes the length-adjusted contrast something other than the total causal effect), and it reverses on the single model-free benchmark — raw human grades favor the AI treatment, significantly (5.07 vs 4.97, p=0.0046). The entanglement is mechanical: length is simultaneously the quantity outcome and the dominant driver of and control for the grades, so the 'more quantity, less quality' narrative is partly an artifact of how length enters both measures. These are flaws of estimand choice, identification, and emphasis rather than fabrication or gross analytic malpractice, and the authors' own robustness checks and open reporting of the divergent human grades cut against the harshest reading — hence overall moderate severity. But the two high-severity points are decisive enough that the quality headline should be read as control-dependent, not as a general property of AI assessment.","review_process":{"aiAgentsUsed":["claim_extraction","methods","statistics","identification","overclaiming","adversarial","author_defence","plain_language","meta_review"],"reviewRounds":2,"humanEditor":{"name":"","role":"","approvalDate":"2026-06-28","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited","editorialActionAfterResponse":"Authors may reply at any time; this critique addresses claims, methods and inference only, never the authors."},"versions":[{"version":"1.0","date":"2026-06-28","note":"Initial publication (test of the delegated /critical-ai-publish --promote path).","changeType":"initial"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Full-text critique of an open arXiv working paper; every span verified an exact substring of the ar5iv full text (content-addressed source store), independently re-checked; DOI resolves via DataCite (title+authors+year matched). Convergence gate (refute+defender+neutral) returned a survives-majority. Targets claims/methods/inference only.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[{"label":"DOI 10.48550/arXiv.2603.02076 (DataCite: title+authors+year matched)","url":"https://doi.org/10.48550/arXiv.2603.02076","verified":true},{"label":"Full text (ar5iv) used for span verification","url":"https://ar5iv.labs.arxiv.org/html/2603.02076","verified":true}],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Open preprint quoted sparingly under criticism/review; critique targets claims, methods and inference only — never the authors."}}}