Two-dimensional early exit optimisation of LLM inference

arXiv cs.AI / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes a two-dimensional early-exit strategy for LLM inference that jointly coordinates layer-wise and sentence-wise stopping for classification tasks.
  • By incrementally processing inputs sentence-by-sentence while progressively activating deeper layers, the method delivers multiplicative compute savings versus optimizing layer-wise or sentence-wise exit alone.
  • Experiments on four SOTA LLMs (3B–8B parameters) and three sentiment datasets show additional 1.4–2.3× speed-ups over the best layer-wise early-exit baselines for simpler tasks, with graceful degradation on harder multi-class settings.
  • The approach is model-agnostic and only needs lightweight classification adapters; it also remains complementary to other efficiency techniques like quantization and pruning.
  • The authors suggest the strategy works best when semantic information accumulates in a predictable way across the input structure, indicating potential beyond sentiment analysis to other sequence-processing tasks.

Abstract

We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4--2.3\times over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.