SpecPL: Disentangling Spectral Granularity for Prompt Learning

arXiv cs.CL / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces SpecPL, a prompt-learning method for vision-language models (VLMs) that addresses modality asymmetry by explicitly modeling spectral granularity rather than relying on a frozen visual encoder as a holistic feature extractor.
SpecPL uses a frozen VAE to split visual signals into semantic low-frequency bands and granular high-frequency details, while a frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants to reduce overfitting.
The core training idea is “Counterfactual Granule Supervision,” where high-frequency signals are permuted so the model must separate visual granularity from semantic invariance for finer discrimination.
The method is presented as a universal plug-and-play booster that improves existing text-oriented prompt-learning baselines (e.g., CoOp and MaPLe) using visual-side guidance.
On 11 benchmarks, SpecPL reports competitive state-of-the-art results, reaching a new best harmonic-mean accuracy of 81.51%, and validates that spectral disentanglement with counterfactual supervision helps balance stability and generalization.

Abstract

Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.