KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

arXiv cs.CL / 3/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • KALAVAI proposes a post-hoc method to fuse independently fine-tuned domain specialist LLMs into one MoE-style model that outperforms each specialist, with gains empirically modeled as gain = 0.82×divergence − 2.72 (R²=0.856).
  • The paper reports that cooperative fusion value is predictable in advance, with gains approaching zero below ~3.3% divergence, allowing practitioners to estimate whether fusion is likely to help before spending compute.
  • In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently and then submit for lightweight routing via MoE routing over ~500 steps, achieving consistent improvements (e.g., +7.72% at 410M and +7.49% at 1B versus the best specialist).
  • Routing effectiveness is reported as extremely close to domain-oracle routing (<10^-5 nats), and learned routing is necessary: uniform averaging underperforms while any trained router can reach oracle-optimal assignment.
  • Cross-lingual and larger-federation experiments show substantial gains, including +21.76% for Tamil/Yoruba/Welsh/Code fusion and +16.71% from a 20-contributor federation, under constraints like shared initialization and limited checkpoint mismatch sensitivity.

Abstract

Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.