GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

GramSR is a one-step diffusion-based single-image super-resolution framework designed to improve restoration under real-world, complex degradations by reducing the mismatch between text semantics and spatially aligned visual details.
Instead of text conditioning, GramSR conditions the diffusion model on dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder.
The method uses a three-stage LoRA training pipeline that sequentially learns pixel-level (degradation removal with L2 loss), semantic-level (perceptual detail enhancement with LPIPS and CSD losses), and texture-level (texture/feature correlation consistency with a DINOv3 Gram-matrix loss).
During inference, separate guidance scales provide controllable trade-offs among degradation removal, semantic enhancement, and texture preservation.
Experiments on standard super-resolution benchmarks show GramSR outperforms existing one-step diffusion-based approaches, with better structural fidelity and more realistic textures, and the code is released on GitHub.

Abstract

Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using

\ell_2

loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism. The code for this work is available at: https://github.com/aimagelab/GramSR.

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

Dev.to

GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer