What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters

arXiv cs.CV / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies parameter-efficient fine-tuning of pre-trained image codecs for machine vision, highlighting that adapting the entropy model’s statistical semantics has been comparatively underexplored.
  • It finds that simply inserting adapters into the entropy model can hurt performance, and that adapter choice must be coordinated with where they are placed in the compression pipeline.
  • The proposed Structure-Semantics Co-Tuning (S2-CoT) framework uses two synergistic adapters: an SFA in the encoder-decoder to preserve high-fidelity spatial/frequency representations, and an SCA in the entropy model to refine channel context for better probabilistic coding.
  • Joint optimization of SFA and SCA converts what would be performance degradation into synergistic gains, reaching state-of-the-art results on four base codecs using only a small fraction of trainable parameters and closely matching full fine-tuning.

Abstract

Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at https://github.com/Brock-bit4/S2-CoT.