RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • RefineAnythingは、ユーザーが指定した領域(マスクやバウンディングボックス)だけを高精細に復元・改良し、非編集領域は厳密に変更しない「領域特化の画像リファインメント」を新しい問題設定として提案しています。
  • 従来の編集モデルが局所的なディテール崩壊(文字・ロゴ・細い構造の歪みなど)を十分に抑えきれない点に対し、マルチモーダル拡散ベースで参照あり/なし両方のリファインメントに対応します。
  • Focus-and-Refineでは、VAEの固定解像度制約下でcrop-and-resizeが局所再構成を改善し得るという観察に基づき、解像度予算をターゲット領域へ再配分して効率と効果を高めます。
  • ブレンドマスクによる貼り戻しとBoundary Consistency Lossにより、背景の厳密保存と縫い目(シーム)アーティファクトの抑制を同時に狙います。
  • 訓練データRefine-30Kと評価ベンチRefineEvalを構築し、編集領域の忠実度と背景一貫性の両面で既存ベースラインを上回り、実用的な高精度ローカル改良手法を示しています。

Abstract

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.