[R] KALAVAI: Predicting When Independent Specialist Fusion Works (gain = 0.82 × divergence − 2.72, R² = 0.856, tested 410M–6.9B)

Reddit r/MachineLearning / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Read original →

共有:

Key Points

The paper proposes KALAVAI, a fusion method where independently fine-tuned specialist models (trained with no communication or shared gradients) are combined using a lightweight MoE router trained on top in roughly 500 steps.
Experiments on Pythia show consistent accuracy gains over the best single specialist—about +6.5% to +8% at 410M–1B and ~+6.5% at 6.9B—while demonstrating that the benefit can be predicted from how much specialists diverge from the base checkpoint (R² = 0.856).
Cross-lingual results are a key highlight: fusing specialists trained on languages largely outside Pythia’s knowledge (e.g., Yoruba and Welsh) dramatically reduces perplexity, with the router matching each specialist’s performance on its own language concurrently.
A 20-contributor experiment (10 languages + 10 domains) yields +16.71% over the best specialist and suggests the router can discover domain overlap patterns (e.g., medical/chemistry routed ~60/40) without being explicitly informed.
The author notes practical limitations: inference cost scales with the number of specialists, results weren’t tested beyond 6.9B, the divergence-to-gain formula is based on only six points (heuristic), and LoRA isn’t sufficient because full fine-tuning of unfrozen layers is required.

Hey all,

I've been working on this for a few months and just put the paper on arXiv: https://arxiv.org/abs/2603.22755

Project page: https://murailabs.com/kalavai/

Code + scripts: https://github.com/mechramc/Kalavai

The basic idea: take a base checkpoint, give copies to a bunch of people, each person fine-tunes on their own domain or language independently (no communication, no shared gradients, nothing), then you collect all the checkpoints and train a lightweight MoE router on top in about 500 steps. The fused model beats every individual specialist.

I tested this at 410M, 1B, and 6.9B on Pythia. The gains are consistent — around +7-8% over the best individual specialist at 410M/1B, +6.5% at 6.9B. The interesting part is the gain is predictable from how much the specialists diverge from the base. I fit a simple linear formula (R² = 0.856) that lets you estimate whether a cooperative is worth doing before anyone trains anything.

The cross-lingual results are what I'm most excited about. I trained specialists on Tamil, Yoruba, Welsh, and Code — languages Pythia basically doesn't know — and fused them. Yoruba perplexity went from 41.9 to 7.7. Welsh from 102.7 to 22.1. The MoE matched each specialist's performance on its own language simultaneously. Nobody shared any data.

I also ran a 20-contributor experiment (10 languages + 10 domains) and got +16.71% over the best specialist. The router figured out on its own that medical and chemistry text should cross-route 60/40 — nobody told it those domains overlap.

Some honest limitations:

- Inference cost scales linearly with number of specialists (you run all of them)

- Haven't tested above 6.9B

- The predictive formula is based on 6 data points — useful as a heuristic, not a universal law

- LoRA doesn't work for this — you need full fine-tuning of unfrozen layers

**Where I could use help:**

I'm targeting NeurIPS 2026 with this and would love independent validation from folks with different hardware setups. The experiment is pretty self-contained:

Pick a Pythia checkpoint (410M is cheapest, runs on consumer GPUs in under an hour)
Fine-tune 3 specialists on different domains for 2,000 steps each
Train the router for 500 steps on mixed data
Compare fused model vs. best individual specialist on held-out eval

Everything you need is in the GitHub repo. If you can reproduce the ~+7% gain at 410M, or even better, try it at scales I haven't tested (13B+), that would be incredibly valuable. I'll credit any independent results that make it into the paper.

If you work with under-resourced languages or have domain-specific data you can't share publicly, this protocol was designed for exactly that situation.

The name is KALAVAI (கலவை) — Tamil for fusion/mixing. Built at Murai Labs.

Happy to answer any questions about the setup, the results, or the failure modes.

submitted by /u/No_Gap_4296
[link] [comments]

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

[R] KALAVAI: Predicting When Independent Specialist Fusion Works (gain = 0.82 × divergence − 2.72, R² = 0.856, tested 410M–6.9B)

Key Points

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer