DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

arXiv cs.CV / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DOne, an end-to-end design-to-code framework that decouples structure understanding from element rendering to avoid common layout distortions in vision-language approaches.
  • DOne uses a learned layout segmentation module, a hybrid element retriever for UI components with extreme aspect ratios/densities, and a schema-guided generation approach to connect layout representation with code output.
  • To better evaluate high-complexity UIs, the authors introduce HiFi2Code, a benchmark with significantly more layout complexity than prior datasets.
  • Experiments on HiFi2Code show DOne improves both high-level visual similarity (including over 10% in GPT Score) and fine-grained element alignment versus existing methods.
  • Human evaluations report a roughly 3× productivity gain while maintaining higher visual fidelity, indicating practical benefits beyond metric improvements.

Abstract

While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.