FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

arXiv cs.LG / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureIndustry & Market MovesModels & Research

Key Points

  • The paper introduces Fed-FSTQ, a Fisher-guided token quantization method aimed at making federated fine-tuning of LLMs practical on edge/mobile settings where uplink is limited and clients join intermittently.
  • It estimates token sensitivity using a lightweight Fisher proxy, then applies importance-aware token selection together with non-uniform mixed-precision quantization to preserve task-critical signals while reducing redundant communication.
  • Fed-FSTQ is model-agnostic and works as a drop-in module for standard federated PEFT pipelines such as LoRA without changing the server aggregation rule.
  • Experiments on multilingual QA and medical QA with non-IID data partitions show large gains, including up to 46x reduction in cumulative uplink traffic to reach the same quality and 52% faster end-to-end time-to-accuracy.
  • When used at inference, Fisher-guided token reduction also provides up to 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, supporting deployment under constrained resources.

Abstract

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.