HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

arXiv cs.LG / 4/27/2026

📰 NewsModels & Research

Key Points

  • HubRouter is a new pluggable routing primitive that replaces O(n^2) attention with an O(nM) hub-mediated mechanism using a small set of learned hub tokens (M << n).
  • It uses an encode–decode–score–council pipeline where hub tokens attend to all tokens, tokens compute routing fingerprints against hubs, a score head selects top-k, and a sparse council attends only to the selected subset.
  • The paper evaluates HubRouter in multiple architectures (a Jamba-style hybrid and a 12-layer Transformer) and finds that fully retrofitting pretrained models was a tested negative case.
  • Results show modest perplexity gains or tradeoffs depending on the setup: Hub-Jamba gives a nominal ~4.2% PPL improvement and large training-throughput gains, a 25% gradual replacement in Transformers improves matched-budget perplexity, and Hub-GPT is strictly causal but costs quality to avoid O(n^2) compute.
  • Experiments and sweeps suggest a reliable hub-count range of M=8–14, with M>=20 showing greater sensitivity across random seeds, and the authors plan to release code and scripts.

Abstract

We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba's 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.