AI Navigate

Zero text between my agents – latent transfer now works cross-model

Reddit r/LocalLLaMA / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • AVP enables two agents to communicate via latent states rather than text, and the setup now works cross-model on Qwen2.5-7B with A100 GPUs, with a Colab notebook to try it.
  • Benchmarks across HumanEval, GSM8K, DebugBench, MATH, and HotpotQA show latent AVP outperforming a text chain on several tasks, with end-to-end speedups up to 5.8x (e.g., HumanEval 67.1% vs 53.0%, GSM8K 90.5% vs 87.0%, DebugBench 51.0% vs 49.0%, HotpotQA 52.5% vs 50.5%).
  • The approach currently requires HuggingFace Transformers + GPU and does not support llama.cpp, Ollama, or cloud APIs; quantized models are untested and vLLM latent support is planned next.
  • Code generation accuracy improves by about 14.1 percentage points with latent transfer, and experiments with Llama 3.2-3B show the same pattern; end-to-end speedups in typical pipelines are 2-3x, with larger gains in isolated decoding steps.

I posted about AVP here a few weeks ago – agents passing KV-cache to each other instead of text. Good discussion, a lot of questions about what benchmarks I actually used and how prefix caching fits in.

Since then, I ran proper benchmarks on A100 (HumanEval, GSM8K, MATH, DebugBench, HotpotQA – n=164-500), got cross-model working, and made a Colab notebook so you can actually try it (free T4, ~8 min).

Heads up – this only works with HuggingFace Transformers + GPU right now. No llama.cpp, no Ollama, no cloud APIs. It needs direct access to model internals. Quantized models untested. vLLM latent support is what I'm working on next. If that's not your stack, the results below at least show where this is going.

Same model, 2 agents (Qwen2.5-7B, A100, seed=42, T=0.7)

Benchmark n Latent (AVP) Text Chain Speedup
HumanEval 164 67.1% 53.0% 1.2x
GSM8K 200 90.5% 87.0% 2.0x
DebugBench 100 51.0% 49.0% 3.0x
MATH 500 66.8% 66.6%
HotpotQA 200 52.5% 50.5% 5.8x

The code generation result surprised me – +14.1pp over text chain (p=0.004, McNemar's). I ran 4 more seeds at T=0.01 to make sure: 70.0%±0.3% latent vs 57.6%±0.3% text. Gap holds at both temperatures. Also checked on Llama 3.2-3B – same pattern (54.3% latent vs 44.5% text). GSM8K across 3 seeds is neutral, everything else p>0.1.

So, code generation gets a real accuracy boost, everything else stays the same but runs 2-6x faster. I'll take that.

One thing to be honest about – these are single-request numbers, not production throughput. With vLLM continuous batching the GPU is already saturated across requests, so the speedup story would look different. The 2-3x is real for sequential HuggingFace pipelines.

Where the speed comes from: Agent A's 20 latent steps run in 0.9s vs 15.6s to decode text – that's 17x. But Agent B still has to decode its own answer (~5.5s either way), so end-to-end you get 2-3x, not 17x. Amdahl's law.

Built on top of LatentMAS which proved same-model latent communication works.

Cross-model

Different models can now share hidden states. Zero training, zero learned parameters. Cross-model is opt-in – you pass cross_model=True and a source= connector, otherwise communication fallbacks to text mode.

You project one model's last hidden state through shared vocabulary into the other model's space. Qwen and Llama share about 85% of their BPE tokens (exact byte-level match) – tokens like "return", "function", "+=". So: source model thinks -> extract hidden state -> project through source output head -> softmax over shared tokens -> project through target input embeddings -> inject. The whole thing is ~100 lines, zero learned parameters. The projection technique itself isn't new (cross-lingual embeddings use the same idea), but I haven't seen it used for cross-model agent communication before.

Same-family (Qwen 7B -> Qwen 3B, shared tokenizer) – projection doesn't break anything. GSM8K: 82.5% rosetta vs 82.5% the 3B gets on its own. HumanEval: 66.5% rosetta vs 61.0% direct, but CIs overlap so could be noise.

Cross-family (Qwen ↔ Llama, single seed=42, T=0.7, A100):

Direction GSM8K Rosetta GSM8K Text HumanEval Rosetta HumanEval Text
Qwen 7B → Llama 3B 77.0% 86.5% 47.0% 57.9%
Llama 3B → Qwen 7B 90.0% 82.0% 79.3% 61.6%

The direction pattern is interesting. When the weaker model solves, text wins – it needs the explicit reasoning. Flip it around and rosetta wins big (GSM8K +8pp, HumanEval +17.7pp). A strong solver can work with a reasoning direction; a weak solver needs the full explanation spelled out.

Solo baselines for reference: Qwen 7B = 91.0% / 58.5%, Llama 3B = 76.0% / 50.6%.

When would you actually use this? If you're running different models for different roles and don't want to serialize everything to text between them. Or if your VRAM budget fits a 3B and 7B together but not two 7Bs.

Cross-model needs both models loaded (~20 GB for 7B+3B). No extra VRAM for latent vs text beyond that.

Where it breaks

Cross-model comprehension is bad – HotpotQA gets 7.5%. A single hidden state can carry "solve this math problem this way" but it can't carry paragraph-level facts (names, dates, multi-hop stuff). I spent a lot of time trying to fix this – multi-embedding, discrete tokens, trained translators up to 29M params, hybrid approaches. 9 attempts, nothing worked. The problem is inputs_embeds injection itself, not the projection.

Fan-out (parallel specialists merging into one agent) also degrades – sequential KV injection from multiple sources confuses the aggregator.

Latent steps: 20 is the sweet spot. 40 gets worse, 80 is garbage. Noise accumulates.

Since it came up last time – prefix caching and AVP solve different problems. Prefix caching reuses KV for identical text. AVP transfers computation between agents with different prompts. You'd use both.

Try it

Colab notebook – free T4, ~8 min, zero setup. Uses Qwen2.5-1.5B on 10 problems. Heads up: at 1.5B all modes are about the same accuracy (text actually wins slightly – typical output is direct 60%, latent 60%, text 70%). The notebook shows zero tokens passing between agents, not the full-scale gains. HumanEval advantage shows up at 7B+.

from avp import HuggingFaceConnector # Same-model connector = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct") context = connector.think("Analyze: 24 * 17 + 3", steps=20) answer = connector.generate("Solve step by step: 24 * 17 + 3", context=context) # Cross-model researcher = HuggingFaceConnector.from_pretrained("Qwen/Qwen2.5-7B-Instruct") solver = HuggingFaceConnector.from_pretrained("meta-llama/Llama-3.2-3B-Instruct") ctx = researcher.think("Analyze: 24 * 17 + 3", steps=20) answer = solver.generate("Solve: 24 * 17 + 3", context=ctx, source=researcher, cross_model=True) 

No LangChain/CrewAI adapter yet – AVP works at the inference layer. Framework integration is on the roadmap.

Happy to answer questions.

submitted by /u/proggmouse
[link] [comments]