Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context

Reddit r/LocalLLaMA / 3/23/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post argues that KV-cache block selection in long-context LLMs remains an O(N) bottleneck because GPUs still scan all N block signatures from HBM at every decode step.
It presents PRISM, a photonic-chip concept that replaces the digital scan with an optical broadcast where the query is sent as light and passive splitters distribute it to all blocks in parallel for simultaneous similarity computation.
The approach claims O(1) selection cost with respect to context length N by computing similarity scores for all blocks at once, using stored signature weights.
Reported results include ~944× faster block selection and ~18,000× lower energy than GPU scanning at 1M context, and a stated ~5.3× faster total decode versus Quest at 100M context (batch=128, Qwen2.5-7B).
The author notes the photonic performance is based on device-physics simulation (no fabricated chip yet), while the repo includes a working GPU-only block selector for evaluation today.

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context

I’m a nanophotonics PhD student, and I think photonic chips can solve the KV cache scanning bottleneck.

Block-sparse methods like Quest/RocketKV reduce blocks fetched, but still scan all N block signatures from HBM every decode step. That scan is O(N) — at 1M context on H100, it’s ~8.5μs per query. In batch serving this becomes the dominant cost.

PRISM replaces the scan with optical broadcast: query encoded as light → split to all N blocks simultaneously via passive splitter → each block’s signature stored as MRR weights → all similarity scores computed at once. O(1) regardless of N.

At 1M context: 944x faster selection, 18,000x less energy. At 100M: 5.3x faster total decode than Quest (batch=128, Qwen2.5-7B).

No fabricated chip — photonic numbers are device-physics simulation on TFLN. GPU scan benchmarks are real measurements. The repo includes a GPU-only block selector that works today (100% needle retrieval, 0% LongBench-v2 drop).

Code + paper: https://github.com/hyoseokp/PRISM

submitted by /u/Exact-Schedule-3442
[link] [comments]

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

Dev.to

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Designed a photonic chip for O(1) KV cache block selection — 944x faster, 18,000x less energy than GPU scan at 1M context

Key Points

Related Articles

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer