The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
arXiv cs.RO / 4/6/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that scaling Vision-Language-Action (VLA) models by improving the vision encoder works for vision-language tasks but can fail for visuomotor action pipelines when actions are represented as discrete tokens.
- It introduces an information-theoretic “Compression Gap” principle: performance scaling is limited by the tightest information bottleneck in the visuomotor pipeline, not by uniformly increasing capacity.
- When actions are continuous (e.g., Diffusion Policy), the vision encoder acts as the binding constraint, so encoder upgrades yield strong gains in manipulation performance.
- When actions are discretized via a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, so encoder improvements do not meaningfully propagate past that bottleneck.
- Experiments on the LIBERO benchmark provide evidence via (1) an encoder-upgrade factorial study, (2) encoder quality gradients across four encoders, and (3) a codebook-size experiment showing that increasing codebook capacity partially restores sensitivity to encoder improvements.




