HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

arXiv cs.LG / 4/27/2026

📰 NewsModels & Research

共有:

Key Points

HubRouter is a new pluggable routing primitive that replaces O(n^2) attention with an O(nM) hub-mediated mechanism using a small set of learned hub tokens (M << n).
It uses an encode–decode–score–council pipeline where hub tokens attend to all tokens, tokens compute routing fingerprints against hubs, a score head selects top-k, and a sparse council attends only to the selected subset.
The paper evaluates HubRouter in multiple architectures (a Jamba-style hybrid and a 12-layer Transformer) and finds that fully retrofitting pretrained models was a tested negative case.
Results show modest perplexity gains or tradeoffs depending on the setup: Hub-Jamba gives a nominal ~4.2% PPL improvement and large training-throughput gains, a 25% gradual replacement in Transformers improves matched-budget perplexity, and Hub-GPT is strictly causal but costs quality to avoid O(n^2) compute.
Experiments and sweeps suggest a reliable hub-count range of M=8–14, with M>=20 showing greater sensitivity across random seeds, and the authors plan to release code and scripts.

Abstract

We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba's 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.

Subagents: The Building Block of Agentic AI

Dev.to

DeepSeek-V4 Models Could Change Global AI Race

AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch

Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

Dev.to

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

Key Points

Abstract

Related Articles

Subagents: The Building Block of Agentic AI

DeepSeek-V4 Models Could Change Global AI Race

Got OpenAI's privacy filter model running on-device via ExecuTorch

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer