Component-Aware Self-Speculative Decoding in Hybrid Language Models

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces “component-aware self-speculative decoding,” a speculative decoding method tailored to hybrid language models by using an internal SSM/linear-attention subgraph as a zero-cost draft.
Experiments on Falcon-H1 and Qwen3.5 hybrid architectures show a major acceptance-rate gap: parallel hybrids achieve α≈0.68 (k=2) under greedy decoding, while sequential hybrids perform poorly (α≈0.038), implying architecture-specific integration matters.
The approach is reported to be scale-invariant for Falcon-H1, where a 3B model reproduces acceptance rates seen at 0.5B.
The authors also find that an ablation-based perplexity degradation ratio predicts speculative viability without running speculative decoding, mapping to higher α for Falcon than for Qwen (e.g., α≈0.37 at k=4 for Falcon vs. α≈0.019 for Qwen).
For sequential hybrids, the paper reports that a generic LayerSkip strategy can outperform the component-aware method by about 12×, suggesting the optimal strategy depends on how hybrid components are composed.

Abstract

Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 -- an 18x gap attributable to how each architecture integrates its components. The property is scale-invariant: Falcon-H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component-aware strategy. The composition pattern of hybrid models -- not merely the presence of alternative components -- determines whether component-level self-speculation is viable.

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Dev.to

Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Reddit r/LocalLLaMA

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Last Week in AI

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Reddit r/LocalLLaMA

Uber Shares What Happens When 1.500 AI Agents Hit Production

Reddit r/artificial

Component-Aware Self-Speculative Decoding in Hybrid Language Models

Key Points

Abstract

Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Uber Shares What Happens When 1.500 AI Agents Hit Production

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer