Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Monarch v3 introduces NES-inspired KV paging for transformer inference, splitting KV cache into a hot (recent, full-precision) region and a cold (older, compressed) region to reduce VRAM pressure.
On a 1.1B-parameter TinyLlama benchmark, the approach reports a 78% inference speedup (17.01 → 30.42 tok/sec) while adding nearly zero VRAM overhead.
The method accelerates attention by computing full attention only over a windowed hot set (default ~512 tokens) and accessing cold tokens only when promoted, leveraging the recency assumption.
Cold KV compression uses TurboQuant (4-bit quantization plus polar encoding and a small residual) and decoding on access to enable much smaller storage for older tokens.
The post states the full algorithm is open source, fully benchmarked, and intended to be ready for practical use, while noting that full generation quality impact versus baseline has not yet been validated.

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

TL;DR: We implemented NES-inspired memory paging for transformers. On a 1.1B parameter model, inference is now 78% faster (17.01 → 30.42 tok/sec) with nearly zero VRAM overhead. The algorithm is open source, fully benchmarked, and ready to use.

The Problem

KV cache grows linearly with sequence length. By 4K tokens, most of it sits unused—recent tokens matter far more than old ones, yet we keep everything in VRAM at full precision.

Standard approaches (quantization, pruning, distillation) are invasive. We wanted something simpler: just move the old stuff out of the way.

The Solution: NES-Inspired Paging

Think of it like a Game Boy's memory banking system. The cache is split into a hot region (recent tokens, full precision) and a cold region (older tokens, compressed). As new tokens arrive, old ones get evicted from hot storage and compressed into cold storage. When a token is promoted (high attention weight), it moves back to hot.

Key trade-off: We only compute full attention against the hot window. Cold tokens are only accessed on explicit promotion. This is fundamentally different from standard attention—it assumes that recent tokens dominate, which is true for many tasks but not all.

Four components work together:

Windowed Attention (the speedup engine)
- Attention only over hot window (default ~512 tokens)
- Older tokens can still be promoted if they're accessed
- Assumption: Recency is a strong signal for attention
- Not validated: Full generation quality impact vs. baseline
TurboQuant Compression (~97% size reduction for cold KV)
- Quantize cold KV to 4-bit integers
- Polar encoding (radius + angle bins) for similarity
- Residual correction (1 bit per value)
- Decode on access with minimal overhead
Sliding Window Eviction
- Recent N tokens stay hot by default
- Old tokens compress to cold storage
- No need to know "important" tokens in advance
Attention-Weighted Promotion
- High-attention tokens can move back to hot
- Sticky mechanism prevents thrashing
- Threshold-based to avoid spurious promotions

Benchmark Results

Setup: TinyLlama-1.1B fp16, 50 generated tokens, windowed attention enabled

Mode	Throughput	VRAM	Hot Window
Standard (full attention)	17.01 tok/s	2112 MB	—
Monarch-v3 (windowed)	30.42 tok/s	2131 MB	512 tokens
Gain	+78.7%	+0.9%	—

The huge speedup comes from computing attention only over recent tokens. The compression saves a little VRAM but isn't the primary win.

Important caveat: This benchmark measures throughput, not generation quality. We haven't validated whether windowed attention + promotion produces text indistinguishable from full attention. The recency assumption works well for many tasks, but may fail on retrieval-heavy or context-dependent queries.

How It Works (Simplified Decode Loop)

for step in 1..100: q = project_query(next_token) # Standard: compute attention over ALL cached tokens # Monarch: compute attention only over HOT window scores_hot = q @ kv_hot.T # ~512 tokens instead of 4096+ # Optional: Check if cold tokens should be promoted # (only if attention scores suggest they matter) if promotion_enabled and max(scores_hot) < promotion_threshold: kv_cold_promoted = decompress(cold_pages) scores_cold = q @ kv_cold_promoted.T if max(scores_cold) > threshold: promote_cold_to_hot() # Softmax over [hot + promoted], apply attention # Old tokens fall out of hot window if len(kv_hot) > window_size: compress_to_cold()

The speedup: you skip computing attention for most old tokens. Whether this preserves generation quality is the open question.

Current Status

Implementation: Working on Hugging Face Transformers with custom cache backend
Benchmarks: Full validation on multiple sequence lengths
Open Source: Apache 2.0, ready to fork
Paper: Full technical spec (NES-inspired paging, compression schemes, evaluation methodology)

Next: CUDA kernel fusion for cold decompression (would push gains further)

Try It

Clone and run:

git clone https://github.com/JohannaWeb/Monarch.git cd Monarch # Install deps pip install -r requirements.txt # Train TinyLlama on Project Falcon knowledge python train_tinyllama_fp16.py # Benchmark standard vs paged inference python src/benchmark_monarch.py \ --model models/tinyllama_fp16 \ --mode both \ --max-new-tokens 100 \ --promotion-threshold 0.15 \ --sticky-threshold 3 \ --json

What We Know & Don't Know

Validated:

Throughput improvement (+78.7% on short sequences)
VRAM overhead is minimal (+0.9%)
Implementation is stable and doesn't crash

Assumed but not validated:

Generation quality is preserved with windowed attention
The recency hypothesis holds for diverse tasks
Gains transfer to longer sequences and larger models
Promotion mechanism correctly identifies important cold tokens

Not implemented:

Full BLEU/perplexity evaluation vs. baseline
Longer sequence benchmarks (>1000 tokens)
Quality evaluation on retrieval-heavy tasks
Multi-token batch decoding (single-sequence only)

FAQ

Q: Does windowed attention degrade generation quality?
A: Unknown. We benchmark throughput and VRAM, not output quality. The recency hypothesis is plausible (recent context matters most), but we haven't run BLEU/perplexity benchmarks against baseline. This is a real gap in validation.

Q: What about KV cache quantization papers?
A: We quantize cold tokens, not hot ones. Hot tokens stay full-precision. But the main speedup is from windowed attention, not compression.

Q: What tasks is this good for?
A: Likely: chat, summarization, RAG where recent context dominates. Unlikely: needle-in-haystack retrieval or memory-heavy tasks where old tokens matter.

Q: What about batched inference?
A: Current implementation is single-sequence. Batching requires careful page management (left as future work).

Q: Can I use this with vLLM or SGLang?
A: Not yet. This is a proof-of-concept on standard Transformers. Integration would require those systems to adopt the custom cache backend.

Built by Johanna with Claude (AI pair programming)

Repo: https://github.com/JohannaWeb/Monarch
Paper: See monarch_nes_paper.html in the repo

submitted by /u/Inevitable_Back3319
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

How a Young Founder Scaled a Gamified App to $14K/Month in Just 4 Months

Dev.to

Graph Neural Ordinary Differential Equations

Dev.to

Explainable Causal Reinforcement Learning for deep-sea exploration habitat design with zero-trust governance guarantees

Dev.to

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

Key Points

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

The Problem

The Solution: NES-Inspired Paging

Benchmark Results

How It Works (Simplified Decode Loop)

Current Status

Try It

What We Know & Don't Know

FAQ

Related Articles

Black Hat USA

Black Hat Asia

How a Young Founder Scaled a Gamified App to $14K/Month in Just 4 Months

Graph Neural Ordinary Differential Equations

Explainable Causal Reinforcement Learning for deep-sea exploration habitat design with zero-trust governance guarantees

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer