{"$schema":"https://policywindow.org/critique/api/schema","critique_id":"CRIT-000026","slug":"racial-disparities-time-to-recidivism","url":"https://policywindow.org/critique/c/racial-disparities-time-to-recidivism","doi":null,"status":"published","critique_type":"editorially_approved_ai_native_critique","publication_date":"2026-06-28","current_version":"1.0","target_paper":{"title":"Fairness Is More Than Algorithms: Racial Disparities in Time-to-Recidivism","authors":["Jessy Xinyi Han","Kristjan Greenewald","Devavrat Shah"],"journal":"arXiv (cs.CY; stat.AP) preprint","doi":"10.48550/arXiv.2504.18629","url":"https://arxiv.org/abs/2504.18629","publicationDate":"2025","paperType":"empirical","accessBasis":"open_access","fullTextUsed":true,"fictional":false,"doi_url":"https://doi.org/10.48550/arXiv.2504.18629"},"source_journal":{"tier":"exception","rankingSources":["https://doi.org/10.48550/arXiv.2504.18629","https://arxiv.org/abs/2504.18629"],"rankingNote":"arXiv preprint (cs.CY; stat.AP), not peer-reviewed; posted under CC BY 4.0. Included as an influential algorithmic-fairness / criminal-justice contribution (a COMPAS audit reframing recidivism as time-to-event); tier 'exception' (preprint). Critiqued at full text via the source store."},"selection_provenance":{"id":"racial-disparities-time-to-recidivism","venue":"arXiv (cs.CY; stat.AP) preprint","inMonitoredSet":false,"determinedTier":null,"recordedTier":"exception","effectiveTier":"exception","kind":"off_list","disclosed":true,"offListPeerReviewed":false},"selection":{"aiAgiCentralityScore":3,"societalRelevanceScore":5,"aiAgiCategories":["inequality_bias_fairness","surveillance_security_policing"],"selectionReason":"Cross-domain generality proof (public_policy white-space): a full-text critique of an AI-in-criminal-justice algorithmic-fairness audit, span-grounded to the gold-OA arXiv full text via the source store."},"scores":{"aiAgiContribution":3,"evidentiarySupport":2,"methodologicalRisk":4,"overclaiming":3,"reproducibilityOrAuditability":3,"societalImpactRelevance":5,"severity":"high","confidence":"high"},"severity_cap_for_access_basis":"high","plain_language_summary":"This arXiv preprint reframes recidivism from a yes/no outcome to a time-to-event (survival) process and offers a survival-based falsification test: if, within a COMPAS risk band, the timing of re-arrest differs by race, that supports rejecting the null that non-algorithmic (socioeconomic) factors do not affect recidivism. Applied to the ProPublica COMPAS data (~10,000 Broward County defendants), the headline is that medium- and high-risk strata show no racial difference, but among low-risk defendants a disparity emerges after roughly seven months, read as evidence that structural factors accumulate over time. The conceptual move is genuine, and the paper is candid in its limitations (jurisdiction, missing socioeconomic variables) and consistently hedged ('possibly', 'may accumulate'). Three cautions remain. First, the headline rests on a single low-risk cell crossing significance, drawn from a battery of log-rank tests across two races, three risk bands, two risk-score outcomes and the whole follow-up window, with the other cells null and with no multiple-comparison correction, hazard ratios, confidence bands or subgroup sample sizes reported, so it reads as weakly supported and the seven-month threshold looks data-selected. Second, on identification, the paper does engage COMPAS's documented race-differential miscalibration and states its exclusion restriction openly, so the issue is narrower than 'a hidden assumption': rejecting the null shows some non-algorithmic factor matters but cannot separate residual algorithmic miscalibration from socioeconomic context, so the specific attribution to housing/employment/support outruns the test. Third, treating non-criminal returns to custody as non-informative censoring is asserted, not defended; those returns are plausibly correlated with the same socioeconomic instability and with race, and informative differential censoring over a long window can itself manufacture the late low-risk disparity. (One span was re-anchored to a clean substring after an arXiv-HTML LaTeX artifact; the identification flaw was softened from high to moderate after a defender confirmed the paper's explicit calibration discussion; a reproducibility item was dropped as non-groundable and double-counted.)","claims":[{"id":"C1","text":"The headline that racial disparities significantly emerge among low-risk defendants rests on a single uncorrected subgroup log-rank result, reported without effect sizes or precision.","type":"methodological","evidenceOffered":"significant disparities begin to appear with longer periods approximately seven months of follow-up","support":"weak","overclaiming":"major","assessment":"The significant low-risk cell is one of many: log-rank tests are run across two races, three risk strata (low 1-4, medium 5-7, high 8-10), two risk-score outcomes (risk of recidivism and of violent recidivism), and the full follow-up window, with the medium- and high-risk cells explicitly null (p>0.1). No multiple-comparison correction, hazard ratios, confidence intervals or bands, or per-subgroup sample sizes are reported, and the 'approximately seven months' threshold reads as identified from the data (read off where a time-varying p-value curve first crosses 0.05) rather than pre-specified. Under this much uncorrected multiplicity, one significant low-risk cell is about what chance predicts, so 'disparities significantly emerge' is weakly supported as presented.","mainWeakness":"A single uncorrected subgroup cell, without effect sizes, precision, or multiplicity control, cannot bear the headline; the seven-month threshold appears data-selected.","confidence":"high"},{"id":"C2","text":"Rejecting the no-non-algorithmic-effect null shows some non-algorithmic factor matters, but the test cannot separate residual algorithmic miscalibration from socioeconomic context.","type":"causal","evidenceOffered":"rejecting the null hypothesis that non-algorithmic factors (including socioeconomic ones) do not affect recidivism","support":"moderate","overclaiming":"moderate","assessment":"Credit where due: the paper openly states this null and engages COMPAS's documented race-differential miscalibration (invoking the Chouldechova impossibility result), so the exclusion restriction is deliberately tested, not smuggled in. The remaining, narrower problem is identifying the mechanism: residual within-bin algorithmic miscalibration and unobserved socioeconomic factors sit in the same unobserved node, so rejecting the null cannot distinguish them. The escalation from the disjunctive null to specifically 'housing, employment, and social support' therefore outruns what the test identifies — although the paper hedges this with 'possibly' and 'may accumulate', which lowers the charge from a failed claim to bounded interpretive over-reach.","mainWeakness":"The test licenses 'some non-algorithmic factor matters', not the specific attribution to socioeconomic structures; residual algorithmic miscalibration is an unseparated rival.","confidence":"high"},{"id":"C3","text":"Treating non-criminal returns to custody as censoring events requires non-informative censoring, which is stated as a formal precondition but never tested or defended.","type":"methodological","evidenceOffered":"return to custody for non-criminal violations (treated as censoring events in our analysis)","support":"weak","overclaiming":"moderate","assessment":"Survival/log-rank inference requires censoring to be independent of the event process; the paper states this as a formal assumption but never tests or probes its sensitivity. Technical/parole returns to custody are plausibly correlated with the same socioeconomic instability (housing, employment) the paper invokes — and with race — making the censoring informative and differential. Over a longer follow-up window, informative differential censoring can itself manufacture the late-emerging low-risk disparity, leaving it as an unrefuted rival for precisely the signal attributed to structural factors. Relatedly, using rearrest as the recidivism event carries race-correlated measurement error from differential policing.","mainWeakness":"An undefended non-informative-censoring assumption on plausibly SES- and race-correlated custody returns; if violated, it reproduces the headline pattern.","confidence":"high"}],"sections":[{"id":"what","title":"What the paper does","body":"Reframing recidivism from a binary outcome to a time-to-event (survival) process, the paper proposes a causal/survival framework and a falsification test: within a COMPAS risk band, a difference in re-arrest timing by race supports rejecting the null that non-algorithmic (socioeconomic) factors do not affect recidivism. Applied to the ProPublica COMPAS data (~10,000 Broward County defendants), the medium/high-risk strata show no racial difference while a low-risk disparity emerges after roughly seven months, read as accumulating structural disadvantage."},{"id":"flaw1","title":"Statistical Inference","body":"The significant low-risk cell is one of many: log-rank tests are run across two races, three risk strata, two risk-score outcomes, and the full follow-up window, with the medium/high cells explicitly null (p>0.1). No multiple-comparison correction, hazard ratios, confidence intervals/bands, or per-subgroup sample sizes are reported, and the 'approximately seven months' threshold reads as data-identified rather than pre-specified. Under this much uncorrected multiplicity, one significant low-risk cell is about what chance predicts, so the headline that disparities significantly emerge is weakly supported as presented."},{"id":"flaw2","title":"Identification (mechanism, not a hidden assumption)","body":"On the merits the paper does better here than a first pass suggests: it states its exclusion restriction openly (it IS the tested null) and explicitly engages COMPAS's documented race-differential miscalibration. The narrower, surviving issue is that rejecting the null shows some non-algorithmic factor matters but cannot separate residual within-bin algorithmic miscalibration from socioeconomic context — both occupy the same unobserved node. The jump from the disjunctive null to specifically 'housing, employment, and social support' therefore outruns what the test identifies, though the paper's hedging ('possibly', 'may accumulate') keeps this to bounded interpretive over-reach."},{"id":"flaw3","title":"Measurement / Informative Censoring","body":"Log-rank inference requires non-informative censoring; the paper asserts this as a formal precondition for non-criminal returns to custody but never tests it. Such technical/parole returns are plausibly SES- and race-correlated, so the censoring is plausibly informative and differential — and over a long follow-up window that alone can manufacture the late low-risk disparity, an unrefuted rival for the very signal attributed to structural factors. Rearrest as the outcome adds race-correlated measurement error from differential policing."},{"id":"strengths","title":"What the paper does well","body":"The conceptual contribution is genuine: recasting recidivism as time-to-event can surface divergences that single-time-point binary comparisons miss, and most algorithmic-fairness work ignores temporal dynamics. The framing is explicitly a falsification test rather than a positive identification of mechanisms, the wording is consistently hedged ('possibly', 'may accumulate', 'suggests'), and the paper reports the null medium/high cells rather than only the favourable one. The limitations section is candid — naming sampling bias, variation in law-enforcement practice, the absence of socioeconomic/community-support variables, and single-jurisdiction (Broward County) non-generalizability — and §2.1 engages ProPublica's documented miscalibration and the Chouldechova impossibility result directly."}],"strongest_critique":"The central empirical claim is not statistically credible as presented: the low-risk racial disparity is a single race-by-risk-by-outcome cell crossing significance after ~7 months, drawn from log-rank tests across two races, three risk strata, two recidivism outcomes and the full follow-up window, with the medium/high cells null, no multiplicity correction, and no hazard ratios, confidence bands, or subgroup sample sizes; the seven-month threshold reads as data-selected. And while the paper's identification frame is more defensible than it first appears (the exclusion restriction is the openly-tested null, and miscalibration is discussed), the test still cannot separate residual algorithmic miscalibration from socioeconomic context, and the censoring of plausibly SES- and race-correlated non-criminal custody returns is assumed non-informative without defence — either of which would mechanically reproduce the observed late low-risk pattern.","strongest_fair_defence":"Read as a conceptual and methodological contribution rather than a definitive empirical finding, the paper is materially more defensible. Recasting recidivism from a binary outcome to a time-to-event process is a genuine and useful move, and the apparatus is explicitly framed as a falsification test, not a positive identification of mechanisms. The wording is consistently hedged ('possibly', 'may accumulate', 'suggests'); the limitations section candidly names sampling bias, law-enforcement variation, the absence of socioeconomic variables, and Broward-only non-generalizability; and §2.1 engages ProPublica's documented miscalibration and the Chouldechova impossibility result rather than ignoring them. The authors also report the null medium/high cells rather than only the favourable one — a point in favour of honest reporting.","final_judgment":"A worthwhile conceptual contribution — treating recidivism as time-to-event and offering a survival-based falsification test for the role of non-algorithmic factors — whose empirical demonstration, as presented, is too underpowered, under-reported, and assumption-dependent to support the inferences attached to it. The single uncorrected low-risk log-rank result, reported without sample sizes, effect sizes, confidence bands, or multiplicity correction, cannot bear 'statistically significant disparities emerge', and the seven-month threshold reads as data-selected. On identification the paper is more careful than a first pass suggests — the exclusion restriction is the openly-tested null and miscalibration is discussed — so that flaw is moderate, not high: the residual problem is that the test cannot separate residual algorithmic miscalibration from socioeconomic context, so the specific attribution to structural factors over-reaches (though hedged). The undefended non-informative-censoring assumption on plausibly SES- and race-correlated custody returns admits an informative-differential-censoring rival that would reproduce the headline. Net: high concerns concentrated in statistical inference and measurement, moderate in identification; the empirical claims should be read as a tentative illustration pending a corrected, fully reported, assumption-tested re-analysis. Procedural note: one verbatim span was re-anchored to a clean substring after an arXiv-HTML LaTeX percent artifact; the identification claim was softened from high to moderate after a defender lens confirmed the paper's explicit calibration discussion and openly-stated null; a reproducibility flaw was dropped as non-groundable from the retrieved text and double-counted against statistical inference.","review_process":{"aiAgentsUsed":["claim_extraction","methods","statistics","adversarial","author_defence","plain_language","meta_review"],"reviewRounds":2,"humanEditor":{"name":"","role":"","approvalDate":"2026-06-28","declaredConflict":"none"},"expertCertification":{"used":false}},"author_response":{"notified":false,"status":"not_yet_invited","editorialActionAfterResponse":"Authors may reply at any time; this critique addresses claims, methods and inference only, never the authors."},"versions":[{"version":"1.0","date":"2026-06-28","note":"Initial publication (cross-domain generality proof — public_policy).","changeType":"initial"}],"transparency":{"modelCardUrl":"/critique/model-card","publicAuditSummary":"Full-text critique of a gold-OA (CC BY 4.0) arXiv preprint; every span verified an exact substring of the full text (source store), independently re-checked, with one span re-anchored after an arXiv-HTML LaTeX artifact; DOI resolves (title+author+year matched via DataCite). Convergence gate (refute+defender+neutral) survives-majority, no sustained defeat; the identification claim was softened to moderate per the defender. Targets claims/methods/inference only.","privateAuditRecordExists":true,"citationVerification":{"status":"complete","checkedSources":[{"label":"DOI 10.48550/arXiv.2504.18629 (DataCite: title+author+year matched)","url":"https://doi.org/10.48550/arXiv.2504.18629","verified":true},{"label":"Full text used for span verification (arXiv HTML)","url":"https://arxiv.org/html/2504.18629v1","verified":true}],"fabricatedCitations":0},"riskReview":{"copyright":"completed","defamation":"completed","note":"Gold-OA preprint quoted sparingly under criticism/review; targets claims/methods/inference only."}}}