UIPress: Bringing Optical Token Compression to UI-to-Code Generation

arXiv cs.CL / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that UI-to-code generation needs true visual-token compression because existing approaches mainly filter or zero features without reducing the actual sequence length that drives prefill latency.
It proposes UIPress, a lightweight learned compression module placed between a frozen ViT encoder and the Qwen3-VL-8B LLM decoder, designed to compress ~6,700 visual tokens down to a fixed 256-token budget.
UIPress uses depthwise-separable convolutions, element-guided spatial reweighting, and a Transformer refinement stage, and pairs this with LoRA on the decoder to bridge the representation gap.
Experiments on Design2Code report that using 256 tokens improves CLIP score to 0.8127 (up +7.5% over the uncompressed baseline and +4.6% over the best inference-time baseline) while achieving a 9.1× time-to-first-token speedup.
The authors claim UIPress is the first encoder-side learned compression method tailored to the UI-to-code task, enabling better efficiency without sacrificing output quality.

Abstract

UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress

{\sim}

6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only

{\sim}

21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1