KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

arXiv cs.CL / 3/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

KALAVAI proposes a post-hoc method to fuse independently fine-tuned domain specialist LLMs into one MoE-style model that outperforms each specialist, with gains empirically modeled as gain = 0.82×divergence − 2.72 (R²=0.856).
The paper reports that cooperative fusion value is predictable in advance, with gains approaching zero below ~3.3% divergence, allowing practitioners to estimate whether fusion is likely to help before spending compute.
In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently and then submit for lightweight routing via MoE routing over ~500 steps, achieving consistent improvements (e.g., +7.72% at 410M and +7.49% at 1B versus the best specialist).
Routing effectiveness is reported as extremely close to domain-oracle routing (<10^-5 nats), and learned routing is necessary: uniform averaging underperforms while any trained router can reach oracle-optimal assignment.
Cross-lingual and larger-federation experiments show substantial gains, including +21.76% for Tamil/Yoruba/Welsh/Code fusion and +16.71% from a 20-contributor federation, under constraints like shared initialization and limited checkpoint mismatch sensitivity.

Abstract

Independently trained domain specialists can be fused post-hoc into a single model that outperforms any individual specialist, and the gain is predictable: gain = 0.82 x divergence - 2.72 (R^2 = 0.856, n=6, 3-26% divergence). This enables practitioners to estimate cooperative value before committing compute. Below ~3.3% divergence, gains approach zero.In the KALAVAI protocol, contributors fine-tune copies of a shared checkpoint independently, then submit for lightweight MoE routing (500 steps). Gains are consistent: +7.72% at 410M (+/-0.02%, 3 seeds), +7.49% at 1B (+/-0.01%, 3 seeds), +6.53% at 6.9B, each over the best specialist. The router matches domain-oracle routing within <10^{-5} nats. Cross-lingual fusion (Tamil/Yoruba/Welsh/Code) achieves +21.76%, with Yoruba perplexity falling 41.9 to 7.7. A 20-contributor federation achieves +16.71% (+/-0.07pp, 3 seeds).Three requirements bound the protocol. Shared initialisation is necessary: checkpoint mismatch degrades routing. Frozen layers are optional below ~10,000 steps and beneficial beyond. Learned routing is essential: uniform averaging degrades by -1.2% vs. best specialist, while any trained router achieves oracle-optimal assignment.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Dev.to

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

Dev.to

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

Reddit r/artificial

KALAVAI: Predicting When Independent Specialist Fusion Works -- A Quantitative Model for Post-Hoc Cooperative LLM Training

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer