Comment on "When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment"

Item: When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment
Author: Critical AI

Critical AI

Post-publication Comment · Critical AI

Comment on “When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment”

Critical AI · published 2026-06-28 · v1.0 · CRIT-000022

Concerning: David Almog, Lucas Lippman, Daniel Martin · arXiv (working paper) · 2026

Severity: ModerateConfidence: HighTier exceptionPreprint · not peer-reviewedOpen-access full textEmpiricalRead the paper ↗

Human–AI interactionLabour markets

Why this paper was selected

Test of /critical-ai-publish: OA full-text critique span-grounded to the ar5iv full text; engages identification + overclaim blind-spot dimensions.

AI/AGI centrality 3/5 · societal relevance 4/5 · source-journal note: An influential working paper (arXiv preprint, not peer-reviewed) critiqued at full text via ar5iv; disclosed off-list, tier 'exception' (preprint).

Summary

A pre-registered, IRB-approved online experiment (N=208 participants, between-subjects, on Prolific) testing whether US workers change how they write image captions when told an AI versus a human will grade them. The design is genuinely strong in several respects: pre-registration, having BOTH ChatGPT and three PhD-student graders score EVERY caption regardless of treatment arm, fixed-rank (top-30%) incentives that neutralize beliefs about evaluator leniency, individual-clustered standard errors with image fixed effects, and a vivid cost comparison ($11.67 vs $6,480). The robust, clean finding is that AI assignment raises output quantity (caption length, +27.8% SD). The headline claim — that AI assessment lowers work QUALITY — is the fragile part, and it rests on two decisive problems. First, that conclusion is obtained only after 'controlling for output quantity,' but the treatment itself moves caption length (and pasting); conditioning on a treatment-moved mediator is a bad-control problem, so the length-adjusted contrast no longer estimates the total causal effect of evaluator assignment on quality. Second, the only quality measure with no model in the loop — raw human grades — runs the OPPOSITE way and is significant: captions written under AI assignment scored HIGHER (5.07 vs 4.97, p=0.0046). The 'lower quality' result is therefore not just control-dependent but contradicts the single model-free benchmark. The quantity finding and the cost comparison are credible; the quality headline is overstated.

Central claims & evidence map

Claim	Type	Evidence offered	Support	Overclaiming	Main weakness
The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output q	Causal	However, when controlling for output quantity, we find that the quality of captions declines when subjects are assigned to be judged by AI.	Weak	Moderate	The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output quantity (caption len
The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assign	Causal	captions evaluated in the ChatGPT evaluation treatment received a statistically significantly higher average grade (5.07 vs. 4.97, two-sided t-test, p-value = 0.0046 0.0046 ). This result is driven by two of the three graders.	Moderate	Moderate	The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assignment received a HIGH
Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth	Methodological	an additional 100 characters increased human-assigned grades by an estimated 1 point, whereas the effect was 0.5 points for ChatGPT	Weak	Moderate	Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth +1 point for human g

Per-claim assessment

C1. The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output q
The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output quantity (caption length, +27.8% SD) — that is the companion headline finding — yet the quality conclusion is reached by 'controlling for output quantity.' Conditioning on a variable the treatment moves is a textbook bad-control problem: the length-adjusted within-quantity contrast no longer estimates the total causal effect of evaluator assignment on quality, and the authors adopt this length-holding-fixed estimand as the causal quantity of interest without arguing why it, rather than the total effect, is the right target. The total (unconditional) effect on the model-free human grade is in fact opposite-signed (see next flaw), which is exactly what a bad control can mask. Partial defense: Figure 7 shows the length-controlled sign is consistent for both graders, giving the adjusted claim internal coherence — but consistency across graders does not rescue the estimand choice.
C2. The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assign
The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assignment received a HIGHER average human grade (5.07 vs 4.97, p=0.0046), driven by two of three graders. The abstract's claim that quality is 'lower... regardless of whether quality is measured using humans or LLM grades' holds only after the length/pasting adjustments flagged in the prior flaw; the unconditional, model-free human result points the other way. This is not concealment — the abstract discloses the 'controlling for quantity' qualifier and the paper reports the divergent grades openly — but the title/abstract framing presents the adjusted result as the general finding while a significant opposite-signed raw result sits in Section 3.2. For the one benchmark with no LLM in the loop, the headline direction reverses.
C3. Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth
Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth +1 point for human graders and +0.5 for ChatGPT on the 0-9 scale. Because the same underlying variable defines the quantity outcome and dominates the quality scores, the two constructs are mechanically entangled. The 'more quantity, less quality' narrative is therefore partly an artifact of length entering both measures — the treatment raises length, length mechanically raises raw grades, and the 'quality decline' only appears once length is partialled back out — rather than two independent constructs trading off. This is the same length variable doing triple duty (treatment-affected outcome, dominant grade driver, and regression control), which is why the headline is so sensitive to the length adjustment.

Scorecard

AI/AGI contribution3.0 / 5

Evidentiary support2.0 / 5

Methodological risk3.0 / 5

Overclaiming3.0 / 5

Reproducibility / auditability3.0 / 5

Societal-impact relevance4.0 / 5

Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.

What the paper does

A pre-registered, IRB-approved online experiment (N=208, between-subjects, Prolific) testing whether US workers change how they write image captions when told an AI vs a human will grade them. Both ChatGPT and three PhD-student graders scored every caption. The robust finding is higher output quantity under AI; the contested finding is lower quality.

Identification: the quality headline is a bad-control result

The headline 'lower quality under AI' result is identified only by conditioning on a post-treatment mediator. The paper documents that the treatment itself causally raises output quantity (caption length, +27.8% SD) — that is the companion headline finding — yet the quality conclusion is reached by 'controlling for output quantity.' Conditioning on a variable the treatment moves is a textbook bad-control problem: the length-adjusted within-quantity contrast no longer estimates the total causal effect of evaluator assignment on quality, and the authors adopt this length-holding-fixed estimand as the causal quantity of interest without arguing why it, rather than the total effect, is the right target. The total (unconditional) effect on the model-free human grade is in fact opposite-signed (see next flaw), which is exactly what a bad control can mask. Partial defense: Figure 7 shows the length-controlled sign is consistent for both graders, giving the adjusted claim internal coherence — but consistency across graders does not rescue the estimand choice.

The model-free measure runs the other way

The only quality measure with no model in the loop — raw human grades — directly contradicts the thesis and is statistically significant: captions written under AI (ChatGPT) assignment received a HIGHER average human grade (5.07 vs 4.97, p=0.0046), driven by two of three graders. The abstract's claim that quality is 'lower... regardless of whether quality is measured using humans or LLM grades' holds only after the length/pasting adjustments flagged in the prior flaw; the unconditional, model-free human result points the other way. This is not concealment — the abstract discloses the 'controlling for quantity' qualifier and the paper reports the divergent grades openly — but the title/abstract framing presents the adjusted result as the general finding while a significant opposite-signed raw result sits in Section 3.2. For the one benchmark with no LLM in the loop, the headline direction reverses.

Length doing triple duty

Caption length is used simultaneously as the 'quantity' outcome AND as the single largest determinant of (and the control for) the quality scores: an extra 100 characters is worth +1 point for human graders and +0.5 for ChatGPT on the 0-9 scale. Because the same underlying variable defines the quantity outcome and dominates the quality scores, the two constructs are mechanically entangled. The 'more quantity, less quality' narrative is therefore partly an artifact of length entering both measures — the treatment raises length, length mechanically raises raw grades, and the 'quality decline' only appears once length is partialled back out — rather than two independent constructs trading off. This is the same length variable doing triple duty (treatment-affected outcome, dominant grade driver, and regression control), which is why the headline is so sensitive to the length adjustment.

What the paper does well

This is a genuinely well-executed experiment and several criticisms have real counterarguments. It is pre-registered (AsPredicted) and IRB-approved, with clean between-subjects randomization that neutralizes selection and ordering confounds. The decision to have BOTH ChatGPT and three PhD-student graders score EVERY caption regardless of treatment arm is exactly the right design for separating evaluator-source effects from treatment effects — and it is precisely this design that SURFACED the human/AI grade divergence, which the authors then report openly rather than bury. The fixed-rank (top-30%) incentive rules out the obvious confound of differential leniency beliefs. Main-effect inference is properly clustered at the individual level with image fixed effects, so the headline tests are not statistically naive. On the bad-control point, the authors plausibly care about a quality-holding-quantity-fixed contrast as a substantive question (do workers trade quantity for quality under AI?), and Figure 7 shows the length-controlled result is consistent for BOTH graders, giving the adjusted claim more internal coherence than the raw numbers alone suggest. The ChatGPT-grade result is reported as robust to temperature and to joint-vs-separate-criterion prompting (footnote 15), blunting reproducibility worries. The cost comparison ($11.67 vs $6,480) is a real, vividly documented contribution about adoption incentives. The quantity result is clean and credible. Much of the remaining dispute is about estimand choice and framing rather than analytic error.

Strongest critique

The central claim — that telling workers an AI will judge them lowers the quality of their work — is not robustly supported by the paper's own data. The only quality benchmark with no model in the loop, raw human grades, goes the OPPOSITE way and is significant: captions produced under AI assignment scored HIGHER (5.07 vs 4.97, p=0.0046). The 'lower quality under AI' conclusion materializes only after 'controlling for output quantity' — but the treatment itself moves quantity (a +27.8% SD length effect is the companion headline). Conditioning on a treatment-moved mediator is a classic bad-control problem: the length-adjusted contrast no longer estimates the total causal effect of evaluator assignment on quality, and the authors never argue why a quality-holding-quantity-fixed comparison, rather than the total effect, is the estimand of interest. The mechanism is mechanical: length is simultaneously the 'quantity' outcome and the dominant driver of the grades (+1 point per 100 characters for humans), so the 'more quantity, less quality' story is partly an artifact of one variable doing triple duty as outcome, grade-driver, and control. A control-dependent headline that reverses on the single model-free measure is the principal weakness.

Strongest fair defence

This is a genuinely well-executed experiment and several criticisms have real counterarguments. It is pre-registered (AsPredicted) and IRB-approved, with clean between-subjects randomization that neutralizes selection and ordering confounds. The decision to have BOTH ChatGPT and three PhD-student graders score EVERY caption regardless of treatment arm is exactly the right design for separating evaluator-source effects from treatment effects — and it is precisely this design that SURFACED the human/AI grade divergence, which the authors then report openly rather than bury. The fixed-rank (top-30%) incentive rules out the obvious confound of differential leniency beliefs. Main-effect inference is properly clustered at the individual level with image fixed effects, so the headline tests are not statistically naive. On the bad-control point, the authors plausibly care about a quality-holding-quantity-fixed contrast as a substantive question (do workers trade quantity for quality under AI?), and Figure 7 shows the length-controlled result is consistent for BOTH graders, giving the adjusted claim more internal coherence than the raw numbers alone suggest. The ChatGPT-grade result is reported as robust to temperature and to joint-vs-separate-criterion prompting (footnote 15), blunting reproducibility worries. The cost comparison ($11.67 vs $6,480) is a real, vividly documented contribution about adoption incentives. The quantity result is clean and credible. Much of the remaining dispute is about estimand choice and framing rather than analytic error.

Conclusion

A competent, transparent, pre-registered experiment whose causal infrastructure (randomization, dual grading of every caption, leniency-neutralizing incentives, individual-clustered SEs with image fixed effects) is solid, and whose quantity result and cost-of-grading comparison are credible. The headline 'AI assessment lowers work quality' claim, however, is overstated. It is fragile in two decisive and tightly linked ways: it is obtained only by conditioning on output quantity, a mediator the treatment itself moves (a bad-control problem that makes the length-adjusted contrast something other than the total causal effect), and it reverses on the single model-free benchmark — raw human grades favor the AI treatment, significantly (5.07 vs 4.97, p=0.0046). The entanglement is mechanical: length is simultaneously the quantity outcome and the dominant driver of and control for the grades, so the 'more quantity, less quality' narrative is partly an artifact of how length enters both measures. These are flaws of estimand choice, identification, and emphasis rather than fabrication or gross analytic malpractice, and the authors' own robustness checks and open reporting of the divergent human grades cut against the harshest reading — hence overall moderate severity. But the two high-severity points are decisive enough that the quality headline should be read as control-dependent, not as a general property of AI assessment.

Reply from the authors

Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.

Reply: not yet invited. No reply has been received for publication.

The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.

Automated re-evaluation after reply: Authors may reply at any time; this critique addresses claims, methods and inference only, never the authors.

References

Every external source this Comment cites, each with a verified link. 0 fabricated.

Source-grounding attestation

✓ attested in-appgrounding: spans in app

✓Verbatim source spans present in the critique — 3/3 provenance spans re-derived in the critique prose
✓Passes the publication validator — no errors
✓Zero fabricated citations — 0 fabricated
✓Severity within the access-basis cap — severity "moderate" ≤ cap "high" for open_access

Every verbatim span the critique relies on is re-derived in the prose in-app; span-in-source is re-verifiable offline (the abstract is re-fetched, not stored, per the no-reproduce policy).

Re-verify span-in-source offline: python3 scripts/verify-queue-critiques.py

Independent faithfulness review

A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.

✓ Faithful0/2 reviewers sustained a concern · source retrieved

The strongest critique survives against the text. Both load-bearing prongs are verified verbatim and are logically sound, and the defender's best rebuttal actually instantiates the critique rather than refuting it. PRONG 1 — bad control (verified). The headline is explicitly conditional: "However, when controlling for output quantity, we find that the quality of captions declines when subjects are assigned to be judged by AI" (line 16; abstract line 8). The treatment itself causally moves the conditioning variable: length 251 vs 229 chars, p<0.0001 (line 45), the +27.8%-SD quantity effect. Conditioning on a treatment-moved mediator means the length-adjusted contrast is not the total causal effect of evaluator assignment on quality. The authors never argue why a quality-holding-quantity-fixed estimand, rather than the total effect, is the target of interest. PRONG 2 — opposite-signed model-free benchmark (verified). The single quality measure with no model in the loop, raw human grades, runs the other way and is significant: "captions evaluated in the ChatGPT evaluation treatment received a statistically significantly higher average grade (5.07 vs. 4.97, two-sided t-test, p-value

Version & correction history

Version	Date	Change
v1.0	2026-06-28	Initial publication (test of the delegated /critical-ai-publish --promote path).

No silent substantive corrections — every change is versioned and visible.

How to cite this Comment

Critical AI. Comment on “When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment” (David Almog et al., arXiv (working paper), 2026). Critical AI; 2026. https://policywindow.org/critique/c/when-ai-judges-your-work

A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.

Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/when-ai-judges-your-work/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique when-ai-judges-your-work --live.

Content fingerprint f2a8679183ecc719 (v1.0) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.