[R] KALAVAI: Predicting When Independent Specialist Fusion Works (gain = 0.82 × divergence − 2.72, R² = 0.856, tested 410M–6.9B)

Reddit r/MachineLearning / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes KALAVAI, a fusion method where independently fine-tuned specialist models (trained with no communication or shared gradients) are combined using a lightweight MoE router trained on top in roughly 500 steps.
  • Experiments on Pythia show consistent accuracy gains over the best single specialist—about +6.5% to +8% at 410M–1B and ~+6.5% at 6.9B—while demonstrating that the benefit can be predicted from how much specialists diverge from the base checkpoint (R² = 0.856).
  • Cross-lingual results are a key highlight: fusing specialists trained on languages largely outside Pythia’s knowledge (e.g., Yoruba and Welsh) dramatically reduces perplexity, with the router matching each specialist’s performance on its own language concurrently.
  • A 20-contributor experiment (10 languages + 10 domains) yields +16.71% over the best specialist and suggests the router can discover domain overlap patterns (e.g., medical/chemistry routed ~60/40) without being explicitly informed.
  • The author notes practical limitations: inference cost scales with the number of specialists, results weren’t tested beyond 6.9B, the divergence-to-gain formula is based on only six points (heuristic), and LoRA isn’t sufficient because full fine-tuning of unfrozen layers is required.

Hey all,

I've been working on this for a few months and just put the paper on arXiv: https://arxiv.org/abs/2603.22755

Project page: https://murailabs.com/kalavai/

Code + scripts: https://github.com/mechramc/Kalavai

The basic idea: take a base checkpoint, give copies to a bunch of people, each person fine-tunes on their own domain or language independently (no communication, no shared gradients, nothing), then you collect all the checkpoints and train a lightweight MoE router on top in about 500 steps. The fused model beats every individual specialist.

I tested this at 410M, 1B, and 6.9B on Pythia. The gains are consistent — around +7-8% over the best individual specialist at 410M/1B, +6.5% at 6.9B. The interesting part is the gain is predictable from how much the specialists diverge from the base. I fit a simple linear formula (R² = 0.856) that lets you estimate whether a cooperative is worth doing before anyone trains anything.

The cross-lingual results are what I'm most excited about. I trained specialists on Tamil, Yoruba, Welsh, and Code — languages Pythia basically doesn't know — and fused them. Yoruba perplexity went from 41.9 to 7.7. Welsh from 102.7 to 22.1. The MoE matched each specialist's performance on its own language simultaneously. Nobody shared any data.

I also ran a 20-contributor experiment (10 languages + 10 domains) and got +16.71% over the best specialist. The router figured out on its own that medical and chemistry text should cross-route 60/40 — nobody told it those domains overlap.

Some honest limitations:

- Inference cost scales linearly with number of specialists (you run all of them)

- Haven't tested above 6.9B

- The predictive formula is based on 6 data points — useful as a heuristic, not a universal law

- LoRA doesn't work for this — you need full fine-tuning of unfrozen layers

**Where I could use help:**

I'm targeting NeurIPS 2026 with this and would love independent validation from folks with different hardware setups. The experiment is pretty self-contained:

  1. Pick a Pythia checkpoint (410M is cheapest, runs on consumer GPUs in under an hour)

  2. Fine-tune 3 specialists on different domains for 2,000 steps each

  3. Train the router for 500 steps on mixed data

  4. Compare fused model vs. best individual specialist on held-out eval

Everything you need is in the GitHub repo. If you can reproduce the ~+7% gain at 410M, or even better, try it at scales I haven't tested (13B+), that would be incredibly valuable. I'll credit any independent results that make it into the paper.

If you work with under-resourced languages or have domain-specific data you can't share publicly, this protocol was designed for exactly that situation.

The name is KALAVAI (கலவை) — Tamil for fusion/mixing. Built at Murai Labs.

Happy to answer any questions about the setup, the results, or the failure modes.

submitted by /u/No_Gap_4296
[link] [comments]