Two-dimensional early exit optimisation of LLM inference

arXiv cs.AI / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes a two-dimensional early-exit strategy for LLM inference that jointly coordinates layer-wise and sentence-wise stopping for classification tasks.
By incrementally processing inputs sentence-by-sentence while progressively activating deeper layers, the method delivers multiplicative compute savings versus optimizing layer-wise or sentence-wise exit alone.
Experiments on four SOTA LLMs (3B–8B parameters) and three sentiment datasets show additional 1.4–2.3× speed-ups over the best layer-wise early-exit baselines for simpler tasks, with graceful degradation on harder multi-class settings.
The approach is model-agnostic and only needs lightweight classification adapters; it also remains complementary to other efficiency techniques like quantization and pruning.
The authors suggest the strategy works best when semantic information accumulates in a predictable way across the input structure, indicating potential beyond sentiment analysis to other sequence-processing tasks.

Abstract

We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4--2.3

\times

over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.

Autoencoders and Representation Learning in Vision

Dev.to

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Two-dimensional early exit optimisation of LLM inference

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer