DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis

arXiv cs.CV / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

提案手法DepthArbは、テキストから画像を生成する拡散モデルで複数物体が密に重なる場面の“奥行き順の遮蔽関係”が破綻しやすい問題を、学習なし（training-free）で解決する枠組みとして提示された。
DepthArbはAttention Arbitration Modulation（AAM）で重なり領域における背面側の注目を抑制し、さらにSpatial Compactness Control（SCC）で注目の発散を抑えて構造整合性を保つことで、遮蔽の曖昧さを注意の競合として調停する。
既存の学習なしレイアウト誘導手法が持ちがちな“深度順に無関係な硬い空間事前”による概念混線や不合理な遮蔽を改善し、モデル再学習なしに一貫した結果を狙える点が強調されている。
遮蔽性能を体系的に評価するためのベンチマークOcclBenchも提案され、DepthArbは遮蔽精度と視覚品質の両面で最先端ベースラインを上回ると報告されている。
DepthArbはプラグアンドプレイとして拡散バックボーンの合成（compositional）能力を高める手法であり、生成モデルにおける“空間レイヤリング”の新しい見方を提供するとされている。

Abstract

Text-to-image diffusion models frequently exhibit deficiencies in synthesizing accurate occlusion relationships of multiple objects, particularly within dense overlapping regions. Existing training-free layout-guided methods predominantly rely on rigid spatial priors that remain agnostic to depth order, often resulting in concept mixing or illogical occlusion. To address these limitations, we propose DepthArb, a training-free framework that resolves occlusion ambiguities by arbitrating attention competition between interacting objects. Specifically, DepthArb employs two core mechanisms: Attention Arbitration Modulation (AAM), which enforces depth-ordered visibility by suppressing background activations in overlapping regions, and Spatial Compactness Control (SCC), which preserves structural integrity by curbing attention divergence. These mechanisms enable robust occlusion generation without model retraining. To systematically evaluate this capability, we propose OcclBench, a comprehensive benchmark designed to evaluate diverse occlusion scenarios. Extensive evaluations demonstrate that DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. As a plug-and-play method, DepthArb seamlessly enhances the compositional capabilities of diffusion backbones, offering a novel perspective on spatial layering within generative models.

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Mistral AI Blog

Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)

Dev.to

Anyone who has any common sense knows that AI agents in marketing just don’t exist.

Dev.to

How to Use MiMo V2 API for Free in 2026: Complete Guide

Dev.to

The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context

Dev.to

DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis

Key Points

Abstract

Related Articles

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)

Anyone who has any common sense knows that AI agents in marketing just don’t exist.

How to Use MiMo V2 API for Free in 2026: Complete Guide

The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer