Component-Aware Self-Speculative Decoding in Hybrid Language Models

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces “component-aware self-speculative decoding,” a speculative decoding method tailored to hybrid language models by using an internal SSM/linear-attention subgraph as a zero-cost draft.
  • Experiments on Falcon-H1 and Qwen3.5 hybrid architectures show a major acceptance-rate gap: parallel hybrids achieve α≈0.68 (k=2) under greedy decoding, while sequential hybrids perform poorly (α≈0.038), implying architecture-specific integration matters.
  • The approach is reported to be scale-invariant for Falcon-H1, where a 3B model reproduces acceptance rates seen at 0.5B.
  • The authors also find that an ablation-based perplexity degradation ratio predicts speculative viability without running speculative decoding, mapping to higher α for Falcon than for Qwen (e.g., α≈0.37 at k=4 for Falcon vs. α≈0.019 for Qwen).
  • For sequential hybrids, the paper reports that a generic LayerSkip strategy can outperform the component-aware method by about 12×, suggesting the optimal strategy depends on how hybrid components are composed.

Abstract

Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 -- an 18x gap attributable to how each architecture integrates its components. The property is scale-invariant: Falcon-H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component-aware strategy. The composition pattern of hybrid models -- not merely the presence of alternative components -- determines whether component-level self-speculation is viable.