The governance concern that post-training combination of multiple specialised models — via weight averaging, task-arithmetic, or modular merging — can produce capability or safety properties not present in any single source model, in ways the original safety evaluations would miss.
Definition and scope
Model merging refers to a family of post-training techniques that combine the weights of multiple fine-tuned models into a single composite model without further training. Methods include simple weight averaging (Wortsman et al. 2022, 'Model Soups'), task arithmetic (Ilharco et al. 2023, 'Editing Models with Task Arithmetic'), TIES-Merging (Yadav et al. 2023, NeurIPS), DARE (Yu et al. 2024), and SLERP-style interpolation. The technique has exploded among open-weight finetuners on Hugging Face — by late-2024 a substantial fraction of the top-ranked Open LLM Leaderboard models were merges rather than single-source fine-tunes. The governance concern arises from a basic combinatorial fact: safety properties are not preserved under merging. A model that has been safety-trained on harmful-content refusals can be merged with a 'helpful-only' or 'uncensored' fine-tune to produce a model that recovers the underlying capability while losing the safety training (Bhardwaj et al. 2024, 'Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic'). Conversely, capability properties can emerge from merges that weren't in any source model. None of the major regulatory regimes (EU AI Act, US EO 14110, China GenAI Measures, NIST AI RMF) explicitly addresses model merging — the regulatory unit of analysis is 'a model' rather than 'a model + its merge descendants.' This is one of the most clearly identified under-governed surfaces in the open-weight ecosystem.
Related concepts
- AI Supply Chain— The end-to-end pipeline of inputs, intermediate artefacts, and downstream applications by which an A
- Model Distillation Risk— The risk that a closed-weight frontier model's capabilities can be partially recovered by training a
- Capability Elicitation— Techniques designed to reveal the upper bounds of an AI model's capabilities, rather than measuring
- Jailbreak Resistance— The robustness of an AI model's safety training against adversarial prompts crafted to elicit policy
- AI Alignment— The technical problem of designing AI systems whose objectives, behaviour, and emergent goals reliab
Appears in topic articles
Editorial note
Model merging is under-governed because regulatory frameworks treat 'the model' as a discrete artefact, whereas open-weight merging produces an unbounded descendant tree. When citing in policy contexts, note the regulatory-unit-of-analysis problem explicitly.
References
Take this further — sign up free
Save, compare, or get alerts when Model-Merging Risk changes. Policy Window is the analyst workbench layered on top of this wiki — free for researchers, civil society, and verified policymakers.