AI Navigate

HoloByte: Continuous Hyperspherical Distillation for Tokenizer-Free Modeling

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • HoloByte introduces a tokenizer-free framework that replaces discrete tokenization with continuous hyperspherical representations for sequence modeling.
  • It projects fixed-capacity byte chunks into a continuous hypersphere using an invertible rotation, enabling a transformer to operate on compressed representations and reducing exact attention complexity.
  • A localized micro-decoder recovers exact byte-level distributions, and a dual-objective Holographic Latent Mean Squared Error bounds gradients to guarantee stability, with a theoretical embedding dimension lower bound D = Ω(W log |V|) for exact recovery.
  • Empirically, under strictly matched parameter constraints, HoloByte outperforms a comparable discrete BPE baseline, demonstrating the viability of vocabulary-invariant sequence modeling, with code publicly available.

Abstract

Sequence modeling universally relies on discrete subword tokenization to circumvent the \mathcal{O}(N^2) computational intractability of native byte-level attention. However, this heuristic quantization imposes artificial morphological boundaries, enforces vocabulary dependence, and fractures the continuity of the optimization landscape. To resolve this dichotomy, we introduce \textbf{HoloByte}: a strictly tokenizer-free framework utilizing Continuous Hyperspherical Distillation. HoloByte partitions discrete byte sequences into fixed-capacity chunks and projects them into a continuous, strictly bounded hyperspherical manifold via an invertible, dimension-preserving orthogonal rotation operator. This spatial superposition allows a macroscopic transformer to operate exclusively on compressed continuous representations, formally reducing the exact attention time complexity from \mathcal{O}(N^2D) to \mathcal{O}\left( \frac{N^2}{W^2}D + ND^2 \right). A localized causal micro-decoder subsequently unbinds these representations to compute exact byte-level distributions. To govern this continuous trajectory, we propose a dual-objective formulation incorporating a mathematically precise Holographic Latent Mean Squared Error, which strictly bounds the gradient and guarantees asymptotic stability. Theoretically, we derive the minimal embedding dimension D = \Omega(W \ln |\mathcal{V}|) required to ensure error-free discrete recovery from the continuous manifold. Empirically, under strictly matched parameter constraints, HoloByte is systematically outperforming a comparable discrete Byte-Pair Encoding (BPE) baseline. These results establish Continuous Hyperspherical Distillation as a mathematically rigorous and computationally tractable foundation for vocabulary-invariant sequence modeling. The code is available at https://github.com/VladimerKhasia/HoloByte