AI Navigate

Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Ontology-Guided Diffusion (OGD), a neuro-symbolic zero-shot framework for sim2real image translation that represents realism as structured knowledge via an ontology and knowledge graph.
  • OGD decomposes realism into interpretable traits (e.g., lighting and material properties) and uses a graph neural network to produce a global embedding that conditions a pretrained diffusion model through cross-attention.
  • A symbolic planner translates ontology traits into a sequence of visual edits, enabling structured instruction prompts that guide the diffusion process toward reduced realism gap.
  • Across benchmarks, OGD better distinguishes real from synthetic images than baselines and achieves state-of-the-art performance in sim2real translation, demonstrating data efficiency and interpretability.
  • The work shows that explicitly encoding realism structure can enable generalizable zero-shot sim2real transfer with broader applicability to vision synthesis.

Abstract

Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.