AE-LLM: Adaptive Efficiency Optimization for Large Language Models

arXiv cs.LG / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper proposes AE-LLM, a unified framework that automatically selects and combines LLM efficiency techniques (e.g., efficient attention, MoE, parameter-efficient fine-tuning, and quantization) based on the specific deployment scenario.
  • AE-LLM uses multi-objective optimization to jointly balance accuracy, latency, memory usage, and energy consumption while respecting hardware constraints and task requirements.
  • It introduces an efficient search algorithm to explore the combinatorial space of efficiency configurations across architecture, fine-tuning, and inference stages, producing Pareto-optimal trade-offs.
  • Experiments on 15 models (0.5B–70B) across 10 tasks show an average 2.8× improvement in efficiency metrics while keeping accuracy close to baseline (within 1.2%).
  • The approach also generalizes to vision-language models, delivering similar efficiency gains and positioning the framework as an automated tool for navigating LLM efficiency trade-offs.

Abstract

Large Language Models (LLMs) have achieved remarkable success across diverse applications, yet their deployment remains challenging due to substantial computational costs, memory requirements, and energy consumption. Recent empirical studies have demonstrated that no single efficiency technique is universally optimal; instead, the effectiveness of methods such as efficient attention mechanisms, mixture-of-experts (MoE), parameter-efficient fine-tuning, and quantization varies significantly depending on task characteristics, resource constraints, and model scales. Building upon these insights, we propose AE-LLM, a unified framework that automatically selects and combines optimal efficiency techniques tailored to specific deployment scenarios. Our approach introduces a multi-objective optimization framework that jointly considers accuracy, latency, memory footprint, and energy consumption, while accounting for hardware constraints and task requirements. We develop an efficient search algorithm that explores the combinatorial space of efficiency techniques across architecture, fine-tuning, and inference stages, identifying Pareto-optimal configurations. Extensive experiments across 15 models (0.5B-70B parameters) and 10 diverse tasks demonstrate that AE-LLM achieves an average of 2.8\times improvement in efficiency metrics while maintaining competitive accuracy (within 1.2\% of baseline), compared to static efficiency configurations. Furthermore, our framework generalizes effectively to vision-language models, achieving similar efficiency gains. Our contributions provide practitioners with an automated tool for navigating the complex trade-off landscape of LLM efficiency optimization.