SpecPL: Disentangling Spectral Granularity for Prompt Learning
arXiv cs.CL / 5/7/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper introduces SpecPL, a prompt-learning method for vision-language models (VLMs) that addresses modality asymmetry by explicitly modeling spectral granularity rather than relying on a frozen visual encoder as a holistic feature extractor.
- SpecPL uses a frozen VAE to split visual signals into semantic low-frequency bands and granular high-frequency details, while a frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants to reduce overfitting.
- The core training idea is “Counterfactual Granule Supervision,” where high-frequency signals are permuted so the model must separate visual granularity from semantic invariance for finer discrimination.
- The method is presented as a universal plug-and-play booster that improves existing text-oriented prompt-learning baselines (e.g., CoOp and MaPLe) using visual-side guidance.
- On 11 benchmarks, SpecPL reports competitive state-of-the-art results, reaching a new best harmonic-mean accuracy of 81.51%, and validates that spectral disentanglement with counterfactual supervision helps balance stability and generalization.




