Post-publication Comment · Critical AI
Comment on “Fairness Is More Than Algorithms: Racial Disparities in Time-to-Recidivism”
Critical AI · published 2026-06-28 · v1.0 · CRIT-000026
Concerning: Jessy Xinyi Han, Kristjan Greenewald, Devavrat Shah · arXiv (cs.CY; stat.AP) preprint · 2025
Why this paper was selected
Cross-domain generality proof (public_policy white-space): a full-text critique of an AI-in-criminal-justice algorithmic-fairness audit, span-grounded to the gold-OA arXiv full text via the source store.
AI/AGI centrality 3/5 · societal relevance 5/5 · source-journal note: arXiv preprint (cs.CY; stat.AP), not peer-reviewed; posted under CC BY 4.0. Included as an influential algorithmic-fairness / criminal-justice contribution (a COMPAS audit reframing recidivism as time-to-event); tier 'exception' (preprint). Critiqued at full text via the source store.
Summary
This arXiv preprint reframes recidivism from a yes/no outcome to a time-to-event (survival) process and offers a survival-based falsification test: if, within a COMPAS risk band, the timing of re-arrest differs by race, that supports rejecting the null that non-algorithmic (socioeconomic) factors do not affect recidivism. Applied to the ProPublica COMPAS data (~10,000 Broward County defendants), the headline is that medium- and high-risk strata show no racial difference, but among low-risk defendants a disparity emerges after roughly seven months, read as evidence that structural factors accumulate over time. The conceptual move is genuine, and the paper is candid in its limitations (jurisdiction, missing socioeconomic variables) and consistently hedged ('possibly', 'may accumulate'). Three cautions remain. First, the headline rests on a single low-risk cell crossing significance, drawn from a battery of log-rank tests across two races, three risk bands, two risk-score outcomes and the whole follow-up window, with the other cells null and with no multiple-comparison correction, hazard ratios, confidence bands or subgroup sample sizes reported, so it reads as weakly supported and the seven-month threshold looks data-selected. Second, on identification, the paper does engage COMPAS's documented race-differential miscalibration and states its exclusion restriction openly, so the issue is narrower than 'a hidden assumption': rejecting the null shows some non-algorithmic factor matters but cannot separate residual algorithmic miscalibration from socioeconomic context, so the specific attribution to housing/employment/support outruns the test. Third, treating non-criminal returns to custody as non-informative censoring is asserted, not defended; those returns are plausibly correlated with the same socioeconomic instability and with race, and informative differential censoring over a long window can itself manufacture the late low-risk disparity. (One span was re-anchored to a clean substring after an arXiv-HTML LaTeX artifact; the identification flaw was softened from high to moderate after a defender confirmed the paper's explicit calibration discussion; a reproducibility item was dropped as non-groundable and double-counted.)
Central claims & evidence map
| Claim | Type | Evidence offered | Support | Overclaiming | Main weakness |
|---|---|---|---|---|---|
| The headline that racial disparities significantly emerge among low-risk defendants rests on a single uncorrected subgroup log-rank result, reported without effect sizes or precision. | Methodological | significant disparities begin to appear with longer periods approximately seven months of follow-up | Weak | Major | A single uncorrected subgroup cell, without effect sizes, precision, or multiplicity control, cannot bear the headline; the seven-month threshold appears data-selected. |
| Rejecting the no-non-algorithmic-effect null shows some non-algorithmic factor matters, but the test cannot separate residual algorithmic miscalibration from socioeconomic context. | Causal | rejecting the null hypothesis that non-algorithmic factors (including socioeconomic ones) do not affect recidivism | Moderate | Moderate | The test licenses 'some non-algorithmic factor matters', not the specific attribution to socioeconomic structures; residual algorithmic miscalibration is an unseparated rival. |
| Treating non-criminal returns to custody as censoring events requires non-informative censoring, which is stated as a formal precondition but never tested or defended. | Methodological | return to custody for non-criminal violations (treated as censoring events in our analysis) | Weak | Moderate | An undefended non-informative-censoring assumption on plausibly SES- and race-correlated custody returns; if violated, it reproduces the headline pattern. |
Per-claim assessment
C1. The headline that racial disparities significantly emerge among low-risk defendants rests on a single uncorrected subgroup log-rank result, reported without effect sizes or precision.
The significant low-risk cell is one of many: log-rank tests are run across two races, three risk strata (low 1-4, medium 5-7, high 8-10), two risk-score outcomes (risk of recidivism and of violent recidivism), and the full follow-up window, with the medium- and high-risk cells explicitly null (p>0.1). No multiple-comparison correction, hazard ratios, confidence intervals or bands, or per-subgroup sample sizes are reported, and the 'approximately seven months' threshold reads as identified from the data (read off where a time-varying p-value curve first crosses 0.05) rather than pre-specified. Under this much uncorrected multiplicity, one significant low-risk cell is about what chance predicts, so 'disparities significantly emerge' is weakly supported as presented.
C2. Rejecting the no-non-algorithmic-effect null shows some non-algorithmic factor matters, but the test cannot separate residual algorithmic miscalibration from socioeconomic context.
Credit where due: the paper openly states this null and engages COMPAS's documented race-differential miscalibration (invoking the Chouldechova impossibility result), so the exclusion restriction is deliberately tested, not smuggled in. The remaining, narrower problem is identifying the mechanism: residual within-bin algorithmic miscalibration and unobserved socioeconomic factors sit in the same unobserved node, so rejecting the null cannot distinguish them. The escalation from the disjunctive null to specifically 'housing, employment, and social support' therefore outruns what the test identifies — although the paper hedges this with 'possibly' and 'may accumulate', which lowers the charge from a failed claim to bounded interpretive over-reach.
C3. Treating non-criminal returns to custody as censoring events requires non-informative censoring, which is stated as a formal precondition but never tested or defended.
Survival/log-rank inference requires censoring to be independent of the event process; the paper states this as a formal assumption but never tests or probes its sensitivity. Technical/parole returns to custody are plausibly correlated with the same socioeconomic instability (housing, employment) the paper invokes — and with race — making the censoring informative and differential. Over a longer follow-up window, informative differential censoring can itself manufacture the late-emerging low-risk disparity, leaving it as an unrefuted rival for precisely the signal attributed to structural factors. Relatedly, using rearrest as the recidivism event carries race-correlated measurement error from differential policing.
Scorecard
Sub-scores are 0–5 editorial judgements on fixed scales (higher is better, except methodological risk and overclaiming where higher is worse). They are contestable and open to a severity challenge from authors.
What the paper does
Reframing recidivism from a binary outcome to a time-to-event (survival) process, the paper proposes a causal/survival framework and a falsification test: within a COMPAS risk band, a difference in re-arrest timing by race supports rejecting the null that non-algorithmic (socioeconomic) factors do not affect recidivism. Applied to the ProPublica COMPAS data (~10,000 Broward County defendants), the medium/high-risk strata show no racial difference while a low-risk disparity emerges after roughly seven months, read as accumulating structural disadvantage.
Statistical Inference
The significant low-risk cell is one of many: log-rank tests are run across two races, three risk strata, two risk-score outcomes, and the full follow-up window, with the medium/high cells explicitly null (p>0.1). No multiple-comparison correction, hazard ratios, confidence intervals/bands, or per-subgroup sample sizes are reported, and the 'approximately seven months' threshold reads as data-identified rather than pre-specified. Under this much uncorrected multiplicity, one significant low-risk cell is about what chance predicts, so the headline that disparities significantly emerge is weakly supported as presented.
Identification (mechanism, not a hidden assumption)
On the merits the paper does better here than a first pass suggests: it states its exclusion restriction openly (it IS the tested null) and explicitly engages COMPAS's documented race-differential miscalibration. The narrower, surviving issue is that rejecting the null shows some non-algorithmic factor matters but cannot separate residual within-bin algorithmic miscalibration from socioeconomic context — both occupy the same unobserved node. The jump from the disjunctive null to specifically 'housing, employment, and social support' therefore outruns what the test identifies, though the paper's hedging ('possibly', 'may accumulate') keeps this to bounded interpretive over-reach.
Measurement / Informative Censoring
Log-rank inference requires non-informative censoring; the paper asserts this as a formal precondition for non-criminal returns to custody but never tests it. Such technical/parole returns are plausibly SES- and race-correlated, so the censoring is plausibly informative and differential — and over a long follow-up window that alone can manufacture the late low-risk disparity, an unrefuted rival for the very signal attributed to structural factors. Rearrest as the outcome adds race-correlated measurement error from differential policing.
What the paper does well
The conceptual contribution is genuine: recasting recidivism as time-to-event can surface divergences that single-time-point binary comparisons miss, and most algorithmic-fairness work ignores temporal dynamics. The framing is explicitly a falsification test rather than a positive identification of mechanisms, the wording is consistently hedged ('possibly', 'may accumulate', 'suggests'), and the paper reports the null medium/high cells rather than only the favourable one. The limitations section is candid — naming sampling bias, variation in law-enforcement practice, the absence of socioeconomic/community-support variables, and single-jurisdiction (Broward County) non-generalizability — and §2.1 engages ProPublica's documented miscalibration and the Chouldechova impossibility result directly.
Strongest critique
The central empirical claim is not statistically credible as presented: the low-risk racial disparity is a single race-by-risk-by-outcome cell crossing significance after ~7 months, drawn from log-rank tests across two races, three risk strata, two recidivism outcomes and the full follow-up window, with the medium/high cells null, no multiplicity correction, and no hazard ratios, confidence bands, or subgroup sample sizes; the seven-month threshold reads as data-selected. And while the paper's identification frame is more defensible than it first appears (the exclusion restriction is the openly-tested null, and miscalibration is discussed), the test still cannot separate residual algorithmic miscalibration from socioeconomic context, and the censoring of plausibly SES- and race-correlated non-criminal custody returns is assumed non-informative without defence — either of which would mechanically reproduce the observed late low-risk pattern.
Strongest fair defence
Read as a conceptual and methodological contribution rather than a definitive empirical finding, the paper is materially more defensible. Recasting recidivism from a binary outcome to a time-to-event process is a genuine and useful move, and the apparatus is explicitly framed as a falsification test, not a positive identification of mechanisms. The wording is consistently hedged ('possibly', 'may accumulate', 'suggests'); the limitations section candidly names sampling bias, law-enforcement variation, the absence of socioeconomic variables, and Broward-only non-generalizability; and §2.1 engages ProPublica's documented miscalibration and the Chouldechova impossibility result rather than ignoring them. The authors also report the null medium/high cells rather than only the favourable one — a point in favour of honest reporting.
Conclusion
A worthwhile conceptual contribution — treating recidivism as time-to-event and offering a survival-based falsification test for the role of non-algorithmic factors — whose empirical demonstration, as presented, is too underpowered, under-reported, and assumption-dependent to support the inferences attached to it. The single uncorrected low-risk log-rank result, reported without sample sizes, effect sizes, confidence bands, or multiplicity correction, cannot bear 'statistically significant disparities emerge', and the seven-month threshold reads as data-selected. On identification the paper is more careful than a first pass suggests — the exclusion restriction is the openly-tested null and miscalibration is discussed — so that flaw is moderate, not high: the residual problem is that the test cannot separate residual algorithmic miscalibration from socioeconomic context, so the specific attribution to structural factors over-reaches (though hedged). The undefended non-informative-censoring assumption on plausibly SES- and race-correlated custody returns admits an informative-differential-censoring rival that would reproduce the headline. Net: high concerns concentrated in statistical inference and measurement, moderate in identification; the empirical claims should be read as a tentative illustration pending a corrected, fully reported, assumption-tested re-analysis. Procedural note: one verbatim span was re-anchored to a clean substring after an arXiv-HTML LaTeX percent artifact; the identification claim was softened from high to moderate after a defender lens confirmed the paper's explicit calibration discussion and openly-stated null; a reproducibility flaw was dropped as non-groundable from the retrieved text and double-counted against statistical inference.
Reply from the authors
Following the practice of Nature Matters Arising, Science Technical Comments and PNAS Letters, this Comment is published as one half of a Comment + Reply pair: the authors of the original article are invited to respond, and any reply is published here verbatim alongside the Comment as part of the record.
Reply: not yet invited. No reply has been received for publication.
The authors have a right of reply and no veto. A reply may request a factual correction, a methodological rebuttal, a clarification, a data/code update, or a severity challenge, and is published unedited. See the right-of-reply policy.
Automated re-evaluation after reply: Authors may reply at any time; this critique addresses claims, methods and inference only, never the authors.
References
Every external source this Comment cites, each with a verified link. 0 fabricated.
Source-grounding attestation
- ✓Verbatim source spans present in the critique — 3/3 provenance spans re-derived in the critique prose
- ✓Passes the publication validator — no errors
- ✓Zero fabricated citations — 0 fabricated
- ✓Severity within the access-basis cap — severity "high" ≤ cap "high" for open_access
Every verbatim span the critique relies on is re-derived in the prose in-app; span-in-source is re-verifiable offline (the abstract is re-fetched, not stored, per the no-reproduce policy).
Re-verify span-in-source offline: python3 scripts/verify-queue-critiques.py
Independent faithfulness review
A refute-by-default adversarial panel (two independent reviewers — an overreach lens and a mischaracterization lens — that fetched the real source) tried to prove this critique misread the paper. This is an AI adversarial review recorded with its reasoning, not a deterministic check.
Three lenses on the full text; survives-majority, no sustained defeat. (1) statistical_inference (HIGH, confirmed by all three) — span "significant disparities begin to appear with longer periods approximately seven months of follow-up" is verbatim in the source (re-anchored from a draft span whose '(p<0.05)' was an arXiv-HTML LaTeX artifact). The full PDF has zero occurrences of hazard ratio / confidence interval / Bonferroni / FDR / multiple comparison and no per-subgroup N; the seven-month threshold is read off a time-varying p-value strip that crosses 0.05, i.e. data-identified, not pre-specified. GROUNDED. (2) identification (softened HIGH->MODERATE per defender) — span "rejecting the null hypothesis that non-algorithmic factors (including socioeconomic ones) do not affect recidivism" is verbatim. The defender correctly showed the exclusion restriction IS the openly-stated null and that the paper engages ProPublica's documented miscalibration (Chouldechova), so the surviving, narrower charge is that the test cannot separate residual algorithmic miscalibration from socioeconomic context — bounded interpretive over-reach, hedged by the authors. GROUNDED at moderate. (3) measurement/informative censoring (HIGH, confirmed) — span "return to custody for non-criminal violations (treated as censoring events in our analysis)" is verbatim; the paper states the independent-censoring assumption formally and never tests it, and SES-/race-correlated technical returns make it plausibly informative. GROUNDED.
Version & correction history
| Version | Date | Change |
|---|---|---|
| v1.0 | 2026-06-28 | Initial publication (cross-domain generality proof — public_policy). |
No silent substantive corrections — every change is versioned and visible.
How to cite this Comment
Critical AI. Comment on “Fairness Is More Than Algorithms: Racial Disparities in Time-to-Recidivism” (Jessy Xinyi Han et al., arXiv (cs.CY; stat.AP) preprint, 2025). Critical AI; 2026. https://policywindow.org/critique/c/racial-disparities-time-to-recidivism
A registered DOI will replace the URL once minted; until then the canonical URL is the persistent identifier. Highwire/Dublin-Core citation tags and a schema.org Review record are embedded in this page for Google Scholar and reference managers.
Verify this Comment. Its checkable facts (target DOI, access-basis severity cap, zero fabricated citations) are served — as the app’s self-report — at /critique/api/critiques/racial-disparities-time-to-recidivism/verify; to confirm them independently of this site, re-derive the same checks (and resolve the target DOI) with npx tsx scripts/verify-critical-ai.ts --critique racial-disparities-time-to-recidivism --live.
Content fingerprint b6be1a3f5c0cc0a8 (v1.0) — this Comment’s substantive content is content-addressed; a silent post-publication edit would change it.