Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

arXiv cs.AI / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper empirically studies how task “cognitive domain” affects acceptance dynamics in tree-based speculative decoding for LLM inference.
  • Using TinyLlama-1.1B as the draft model and Llama-2-7B-Chat-GPTQ as the target, the authors analyze 99,768 speculative nodes from 200 prompts across code generation, mathematical reasoning, logical reasoning, and open-ended chat.
  • Results show that task type predicts acceptance probability more strongly than tree depth, and only the chat domain has consistently expected accepted lengths greater than 1.0 token per step.
  • The study finds entropy and acceptance are negatively correlated but only weakly across domains (rho about -0.20 to -0.15), and it is counterintuitive that chat has both the highest entropy and the highest acceptance rate.
  • The findings suggest practical guidance for domain-aware speculation budgets and choosing draft models tailored to the target task type.

Abstract

Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains (rho in [-0.20, -0.15]). Counterintuitively, chat produces the highest entropy yet the highest acceptance rate. We attribute this divergence to the lexical predictability of RLHF-aligned register. These findings have direct implications for domain-aware speculation budgets and draft-model selection strategies. Index Terms--speculative decoding, large language model inference, tree attention, draft model, acceptance probability, LLM efficiency