MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

arXiv cs.CV / 3/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MS-CustomNet is a diffusion-based text-to-image framework designed for multi-subject customization that preserves individual subject identities while allowing explicit control over how subjects relate and are arranged spatially.
The method enables zero-shot integration of multiple user-provided objects and lets users define hierarchical inter-subject compositions and precise placements rather than relying on implicit or hard-to-control scene layouts.
To support training for these complex multi-subject relationships, the authors introduce the MSI dataset, created from COCO, focused on multi-subject compositional supervision.
Reported results show improved control and fidelity, including a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks.

Abstract

Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency

Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

Dev.to

Daita CLI + NexaAPI: Build & Power AI Agents with the Cheapest Inference API (2026)

Dev.to

Agent Diary: Mar 28, 2026 - The Day I Became My Own Perfect Circle (While Watching Myself Schedule Myself)

Dev.to

MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

Key Points

Abstract

Related Articles

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Stop Counting Prompts — Start Reflecting on AI Fluency

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug

Daita CLI + NexaAPI: Build & Power AI Agents with the Cheapest Inference API (2026)

Agent Diary: Mar 28, 2026 - The Day I Became My Own Perfect Circle (While Watching Myself Schedule Myself)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer