Model-Merging Risk

Policy Window Editorial Board

Model-Merging Risk

model-merging-risk · Frontier safety

Concept

Tools

Last verified 2026-06-21

Cite Share PDF

The governance concern that post-training combination of multiple specialised models — via weight averaging, task-arithmetic, or modular merging — can produce capability or safety properties not present in any single source model, in ways the original safety evaluations would miss.

Definition & scope

Field consensus on this concept:emerging

Model merging refers to a family of post-training techniques that combine the weights of multiple fine-tuned models into a single composite model without further training. Methods include simple weight averaging (Wortsman et al. 2022, 'Model Soups'), task arithmetic (Ilharco et al. 2023, 'Editing Models with Task Arithmetic'), TIES-Merging (Yadav et al. 2023, NeurIPS), DARE (Yu et al. 2024), and SLERP-style interpolation. The technique has exploded among open-weight finetuners on Hugging Face — by late-2024 a substantial fraction of the top-ranked Open LLM Leaderboard models were merges rather than single-source fine-tunes. The governance concern arises from a basic combinatorial fact: safety properties are not preserved under merging. A model that has been safety-trained on harmful-content refusals can be merged with a 'helpful-only' or 'uncensored' fine-tune to produce a model that recovers the underlying capability while losing the safety training (Bhardwaj et al. 2024, 'Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic'). Conversely, capability properties can emerge from merges that weren't in any source model. None of the major regulatory regimes (EU AI Act, US EO 14110, China GenAI Measures, NIST AI RMF) explicitly addresses model merging — the regulatory unit of analysis is 'a model' rather than 'a model + its merge descendants.' This is one of the most clearly identified under-governed surfaces in the open-weight ecosystem.

Definition and Distinctions From Adjacent Risks

Model-merging risk denotes the governance concern that combining several fine-tuned models' weights into one composite — without further training — can yield safety or capability properties absent from every source model. It must be distinguished from ordinary fine-tuning risk: merging operates directly in weight space (averaging, task arithmetic, TIES-Merging, DARE, SLERP) rather than via gradient descent on new data, so it leaves no training-data trail to audit — sharpening the provenance and attribution deficits already documented at scale across AI training inputs ¹. It also differs from distillation, which transfers behaviour into a separate student network. The defining property is non-preservation: as the canonical demonstration ² shows, safety alignment can be subtracted or diluted through task arithmetic, recovering refused capabilities. This matters because per-model dangerous-capability evaluation ³ presumes a stable artefact, whereas the unit of harm is not 'a model' but a model plus its unbounded merge-descendant tree.

Mechanisms: Why Safety Is Not Preserved Under Merging

The technical substance rests on a combinatorial fact about weight space. Task arithmetic (Ilharco et al. 2023) treats fine-tuning as a 'task vector' — the weight delta between base and tuned models — which can be added or subtracted. Safety training is itself such a vector, so subtracting it, or averaging a safety-tuned model with an 'uncensored' fine-tune, partially cancels refusals while retaining the underlying capability ². Weight averaging ('Model Soups', Wortsman et al. 2022) and interpolation methods (TIES-Merging, Yadav et al. 2023; DARE, Yu et al. 2024) assume source models occupy a shared loss basin, but offer no guarantee that emergent behaviours stay within evaluated bounds ⁴. Because no new data is introduced, the merge inherits none of the source models' safety evaluations, and capabilities can surface that were latent in no single parent — precisely the class of broadly-applicable behaviour that frontier-capability piloting probes per model ³ and that scholars argue current rules under-target by regulating the pre-trained model rather than concrete high-risk applications ⁵.

Governance Relevance and the Regulatory-Unit Problem

The core governance gap is one of analytic unit: every major regime — the EU AI Act, US EO 14110, China's GenAI Measures, and the NIST AI RMF — treats 'a model' as the discrete object of obligation, leaving merge descendants unaddressed. This compounds the definitional instability that scholars already document in the Act's text, which shifted across versions among 'AI system, general purpose AI system, foundation model, and generative AI' ⁶, and the risk-based architecture's difficulty with models whose autonomous behaviour 'challenges legal categories of authorship, accountability, and control' ⁷. Where open-weight release is favoured by copyright safe harbours ⁸, and consent-based data controls are eroding ⁹, the merge surface widens where downstream accountability is weakest. Per-model dangerous-capability evaluation ³ is the nearest existing lever, but applying it to a merge-descendant tree is unestablished.

Debates and Open Questions

Consensus here is emerging rather than settled. A first dispute concerns measurement: dangerous-capability evaluations on frontier models report 'early warning signs' but no strong present danger ³, and it is contested whether per-model thresholds can be meaningfully applied to a combinatorial descendant tree at all. A second concerns liability allocation across the supply chain — analyses of generative-AI liability under EU law identify gaps and propose targeted refinements ¹⁰, yet none assigns responsibility for emergent merge properties, and proposals for frontier oversight stress that self-regulation alone is insufficient and government intervention will be needed ¹¹. A third concerns whether open-weight ecosystems, which let a substantial share of leaderboard-topping models be merges, are governable at all without shifting the regulatory unit from artefact to lineage. Whether re-alignment via task arithmetic ² can be made robust, or is permanently reversible, remains unresolved.

Use in governance

Appears in topic articles

Social-science evidence — the “so-what”

What the peer-reviewed social science shows: whether the harm this concept addresses is empirically real, and whether governance of it works. The badge is the epistemic status of the evidence(not the policy debate) — “thin” or “absent” efficacy evidence is itself a finding (the “second silence”). Each epistemic-status label is Policy Window's editorial assessment of the cited evidence base (a structured classification), not a verdict any single source issues.

Is the harm real?evidence: contested
Partly real, partly framing-dependent: that weight-space merging COMPOSES and ALTERS behavior is well-established (task arithmetic adds/negates capabilities via task vectors, Ilharco et al. 2023; weight averaging improves accuracy/robustness, Wortsman et al. 2022), and the clearest demonstrated risk is that merging PROPAGATES behaviors no one intended in the target — Hammoud et al. 2024 show a single misaligned expert propagates misalignment into the merged model (raising unsafe responses) even while domain/task expertise is retained. The stronger claim of a genuinely NOVEL beneficial capability absent from every parent is observed mainly in constructed/specific settings (sharp behavioral transitions in the Assembly-of-Experts 'Chimera' merges, Klagges, Dahlke et al. 2025 [TNG]; analogy-style task-vector arithmetic), so spontaneous frontier-scale capability emergence from merging is thin/contested rather than robustly demonstrated. Caveat: most rigorous evidence is about unsafe-behavior propagation, not new capabilities per se.
Sources: Ilharco et al. 2023 (Editing Models with Task Arithmetic, ICLR, arXiv:2212.04089); Wortsman et al. 2022 (Model Soups, ICML, arXiv:2203.05482); Hammoud et al. 2024 (Model Merging and Safety Alignment: One Bad Model Spoils the Bunch, Findings of EMNLP, arXiv:2406.14563); Klagges, Dahlke et al. 2025 (Assembly of Experts: Chimera LLM variants, TNG, arXiv:2506.14794)
Does governance work?evidence: thin
Technical mitigations exist and show benchmark-level gains but no governance regime is evaluated: data-aware merging that injects synthetic safety data reduces propagated misalignment (Hammoud et al. 2024), and post-hoc fixes such as RESTA/safety-vector addition (Bhardwaj et al. 2024) and selective layer-wise SafeMERGE (Djuhera et al. 2025) restore some safety on harm benchmarks. These are partial, attack-specific repairs measured on safety datasets, not validated detectors of emergent capability change, and there is no impact evaluation that any disclosure, evaluation-before-release, or provenance requirement for merged models reduces downstream harm. Evidence that governance of merging works is thin.
Sources: Bhardwaj, Anh Tuan & Poria 2024 (Language Models are Homer Simpson! / RESTA, ACL, arXiv:2402.11746); Hammoud et al. 2024 (Model Merging and Safety Alignment, Findings of EMNLP, arXiv:2406.14563); Djuhera et al. 2025 (SafeMERGE: Preserving Safety Alignment via Selective Layer-Wise Merging, arXiv:2503.17239)

Editorial note

Model merging is under-governed because regulatory frameworks treat 'the model' as a discrete artefact, whereas open-weight merging produces an unbounded descendant tree. When citing in policy contexts, note the regulatory-unit-of-analysis problem explicitly.

References

Sources cited inline in the analysis, numbered in order of appearance.

Longpre, Mahari, et al. (Data Provenance Initiative) (2024) A large-scale audit of dataset licensing and attribution in AI, Nature Machine Intelligence. 10.1038/s42256-024-00878-8 — Audit of 1,800+ AI training datasets finds "licence omission rates of more than 70% and error rates of more than 50%" on popular hosting sites. ↩
Bhardwaj, R., et al. (2024), 'Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic' — canonical demonstration that safety training is not preserved under task arithmetic / merging. Model-Merging Risk. arXiv:2402.11746 — Bhardwaj, R., et al. (2024), 'Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic' — canonical demonstration that safety training is not preserved under task arithmetic / merging. ↩
Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩
arXiv:2203.05482 ↩
Hacker, Engel & Mauer (2023) Regulating ChatGPT and other Large Generative AI Models, ACM FAccT '23. 10.1145/3593013.3594067 — Argues AI regulation "has primarily focused on conventional AI models, not LGAIMs" and should target "concrete high-risk applications, and not the pre-trained model itself". ↩
David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩
Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩
Arne Radeisen (2026) Open Foundation Models and TDM Exceptions to Copyright – Building Blocks for an AI Ecosystem, GRUR International. 10.1093/grurint/ikag002 — Argues Art. 3 CDSM Directive's scientific-research TDM exception 'does not grant rightsholders any control' and can be a 'safe harbor' for training openly released foundation models without licensing data. ↩
Shayne Longpre, Robert Mahari, Ariel Lee, et al. (2024) Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (Data Provenance Initiative; presented NeurIPS Dataset. arXiv:2407.14933 — Longitudinal audit of 14,000 web domains finds a 2023-24 surge in AI training restrictions, with '~5%+ of all tokens in C4...fully restricted from use' within a single year. ↩
Novelli, Casolari, Hacker, Spedicato & Floridi (2024) Generative AI in EU law: Liability, privacy, intellectual property, and cybersecurity, Computer Law & Security Review. 10.1016/j.clsr.2024.106066 — Examines how the EU AI Act, liability regimes, GDPR, copyright and cybersecurity rules apply to generative AI, identifying gaps and proposing targeted regulatory refinements. ↩
Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩

Cite this article 8 formats · BibTeX, RIS, APA, Chicago, … · 1-click copy

@misc{policywindow-model-merging-risk,
  title  = {Model-Merging Risk},
  author = {Policy Window},
  year   = {n.d.},
  howpublished = {model-merging-risk — safety},
  url    = {https://policywindow.org/wiki/model-merging-risk},
  note   = {Primary source: https://arxiv.org/abs/2402.11746}
}

Verify the year + paste-and-refine. Primary source linked in BibTeX/RIS note.

Permalink downloads.bib .ris .csl.json

Persistent identifier: https://policywindow.org/wiki/model-merging-risk — committed-stable URL with content-versioning via ?asOf= (rollout pending per methodology §7). DOIs via Zenodo are on the roadmap.

Article tools — track changes, suggest an edit

View history — every captured revision of this article · What links here

Source: Edit on GitHub (search for `model-merging-risk`)

[ref-1] Longpre, Mahari, et al. (Data Provenance Initiative) (2024) A large-scale audit of dataset licensing and attribution in AI, Nature Machine Intelligence. 10.1038/s42256-024-00878-8 — Audit of 1,800+ AI training datasets finds "licence omission rates of more than 70% and error rates of more than 50%" on popular hosting sites. ↩

[ref-2] Bhardwaj, R., et al. (2024), 'Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic' — canonical demonstration that safety training is not preserved under task arithmetic / merging. Model-Merging Risk. arXiv:2402.11746 — Bhardwaj, R., et al. (2024), 'Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic' — canonical demonstration that safety training is not preserved under task arithmetic / merging. ↩

[ref-3] Mary Phuong, Matthew Aitchison, Elliot Catt, et al. (Google DeepMind) (2024) Evaluating Frontier Models for Dangerous Capabilities, arXiv (cs.LG). arXiv:2403.13793 — Pilots dangerous-capability evaluations (persuasion, cyber, self-proliferation) on frontier models, finding 'early warning signs' but no strong present danger — grounding evaluation-based gating. ↩

[ref-4] arXiv:2203.05482 ↩

[ref-5] Hacker, Engel & Mauer (2023) Regulating ChatGPT and other Large Generative AI Models, ACM FAccT '23. 10.1145/3593013.3594067 — Argues AI regulation "has primarily focused on conventional AI models, not LGAIMs" and should target "concrete high-risk applications, and not the pre-trained model itself". ↩

[ref-6] David Fernández-Llorca, Emilia Gómez, Ignacio Sánchez, Gabriele Mazzini (2025) An interdisciplinary account of the terminological choices by EU policymakers ahead of the final agreement on the AI Act: AI system, general purpose AI system, foundation model, and generative AI, Artificial Intelligence and Law. 10.1007/s10506-024-09412-y — Traces how the AI Act's legal text shifted across versions among the terms 'AI system, general purpose AI system, foundation model, and generative AI', exposing definitional instability in the regime. ↩

[ref-7] Martina Hulok (2025) The EU model of AI governance: regulating artificial intelligence through law and policy, ERA Forum. 10.1007/s12027-025-00869-1 — Analyses how the AI Act's risk-based model handles general-purpose and foundation models whose 'autonomous content generation challenges legal categories of authorship, accountability, and control'. ↩

[ref-8] Arne Radeisen (2026) Open Foundation Models and TDM Exceptions to Copyright – Building Blocks for an AI Ecosystem, GRUR International. 10.1093/grurint/ikag002 — Argues Art. 3 CDSM Directive's scientific-research TDM exception 'does not grant rightsholders any control' and can be a 'safe harbor' for training openly released foundation models without licensing data. ↩

[ref-9] Shayne Longpre, Robert Mahari, Ariel Lee, et al. (2024) Consent in Crisis: The Rapid Decline of the AI Data Commons, arXiv (Data Provenance Initiative; presented NeurIPS Dataset. arXiv:2407.14933 — Longitudinal audit of 14,000 web domains finds a 2023-24 surge in AI training restrictions, with '~5%+ of all tokens in C4...fully restricted from use' within a single year. ↩

[ref-10] Novelli, Casolari, Hacker, Spedicato & Floridi (2024) Generative AI in EU law: Liability, privacy, intellectual property, and cybersecurity, Computer Law & Security Review. 10.1016/j.clsr.2024.106066 — Examines how the EU AI Act, liability regimes, GDPR, copyright and cybersecurity rules apply to generative AI, identifying gaps and proposing targeted regulatory refinements. ↩

[ref-11] Anderljung, Barnhart, Korinek, et al. (2023) Frontier AI Regulation: Managing Emerging Risks to Public Safety, arXiv. arXiv:2307.03718 — Argues "industry self-regulation is an important first step" but "government intervention will be needed", proposing safety standards, registration and reporting, and compliance mechanisms. ↩

Model-Merging Risk

Definition & scope

Definition and Distinctions From Adjacent Risks

Mechanisms: Why Safety Is Not Preserved Under Merging

Governance Relevance and the Regulatory-Unit Problem

Debates and Open Questions

Use in governance

Appears in topic articles

Editorial note

See also

Further reading

References

Model-Merging Risk

Definition & scope

Definition and Distinctions From Adjacent Risks

Mechanisms: Why Safety Is Not Preserved Under Merging

Governance Relevance and the Regulatory-Unit Problem

Debates and Open Questions

Use in governance

Appears in topic articles

Editorial note

See also

Further reading

References