Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes a hardware-aware framework to run a LLaMA-based multilingual foundational LLM efficiently on Samsung Galaxy S24/S25 smartphones despite tight memory, latency, and runtime constraints.
It uses a single frozen inference graph with application-specific LoRAs provided as runtime inputs, allowing dynamic task switching without recompilation or extra memory overhead.
A multi-stream decoding method generates stylistic variants (e.g., formal/polite/jovial) concurrently in one forward pass, cutting latency by up to 6x.
For faster token generation, it applies Dynamic Self-Speculative Decoding (DS2D), a tree-based approach that predicts future tokens without a separate draft model, improving decode speed up to 2.3x.
With INT4 quantization and additional architecture-level optimizations, the system delivers 4–6x overall gains in memory and latency while preserving accuracy across 9 languages and 8 tasks.

Abstract

Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.