Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the Hierarchical Kernel Transformer (HKT), a multi-scale attention mechanism that uses trainable causal downsampling at L resolution levels and fuses per-level score matrices with learned convex weights.
  • It provides theoretical analysis showing the hierarchical score matrix can form a positive semidefinite kernel under a sufficient condition and that each scale yields a unique symmetric (reciprocal) and antisymmetric (directional) attention decomposition.
  • The authors derive an approximation-error decomposition with interpretable terms, including an explicit non-Gaussian correction and a geometric decay bound as L increases.
  • HKT is proven to strictly subsume both standard attention and causal convolution, while the total compute is bounded to about 4/3 of standard attention (1.3125x when L=3).
  • Experiments across three datasets (ListOps, sequential CIFAR-10, and IMDB character-level sentiment) report consistent accuracy gains over retrained standard attention baselines at ~1.31x overhead.

Abstract

The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.