ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ConfLayers, a new self-speculative decoding method aimed at speeding up large language model text generation without reducing output quality.
  • Unlike prior approaches that learn layer-skipping heuristics or train policies, ConfLayers builds the draft model via confidence-based intermediate layer skipping in a plug-and-play manner.
  • ConfLayers iteratively computes confidence scores for layers, adaptively chooses which layers to skip using a changing threshold, evaluates the resulting performance, and repeats until improvements stall or a limit on iterations is reached.
  • The approach avoids the training overhead and complexity of learning a dedicated layer-skipping policy while maintaining the draft model’s adaptivity to different tasks and datasets.
  • Experiments across multiple models and datasets indicate ConfLayers can achieve up to a 1.4× speedup over standard (vanilla) LLM generation.

Abstract

Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation. However, using heuristic-based approaches to select layers to skip can often be simpler and more effective. In this paper, we propose ConfLayers, a dynamic plug-and-play approach to forming the draft model in self-speculative decoding via confidence-based intermediate layer skipping. The process iteratively computes confidence scores for all layers, selects layers to skip based on an adaptive threshold, evaluates the performance of the resulting set, and updates the best selection until no further improvement is achieved or a maximum number of iterations is reached. This framework avoids the overhead and complexity of training a layer skipping policy and can provide more consistent speed-quality trade-offs while preserving the adaptivity of the draft model to diverse tasks and datasets. The performance evaluation of ConfLayers across different models and datasets shows that our novel approach offers up to 1.4x speedup over vanilla LLM generation.