Parallel Prefix Verification for Speculative Generation

arXiv cs.AI / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces PARSE (PArallel pRefix Speculative Engine), a speculative decoding framework that speeds up LLM inference by performing prefix verification at a semantic/segment level rather than purely at the token level.
PARSE overcomes the overhead of prior semantic-segment verification by using parallel prefix verification in a single forward pass, leveraging a custom attention mask to find the maximal valid prefix without sequential checks.
The method is designed to be orthogonal to existing token-level speculative decoding, so it can be combined with other techniques for additive performance improvements.
Experiments across multiple models and benchmarks show throughput gains of about 1.25× to 4.3× versus the target model alone, and 1.6× to 4.5× when composed with EAGLE-3, with negligible accuracy loss.
Overall, the results position parallel prefix verification as a general, compute-efficient approach for accelerating LLM inference.

Abstract

We introduce PARSE (PArallel pRefix Speculative Engine), a speculative generation framework that accelerates large language model (LLM) inference by parallelizing prefix verification on a semantic level. Existing speculative decoding methods are fundamentally limited by token-level equivalence: the target model must verify each token, leading to short acceptance lengths and modest speedups. Moving to semantic or segment-level verification can substantially increase acceptance granularity, but prior approaches rely on sequential verification, introducing significant overhead and limiting practical gains. PARSE introduces parallel prefix verification, enabling semantic-level verification without sequential checks. Given a full draft from a draft model, the target model evaluates correctness across multiple prefixes in a single forward pass using a custom attention mask, directly identifying the maximal valid prefix. This eliminates sequential segment verification, and makes verification compute-efficient. PARSE is orthogonal to token-level speculative decoding and can be composed with it for additional gains. Across models and benchmarks, PARSE delivers

1.25\times

4.3\times

throughput gain over the target model, and

1.6\times

4.5\times

when composed with EAGLE-3, all with negligible accuracy degradation. This demonstrates parallel prefix verification as an effective, general approach to accelerating LLM inference.