When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies “serialization friction,” where LLMs typically flatten 2D-structured tasks into 1D token sequences, obscuring row/column alignment and local neighborhoods needed for some computations.
Using a small suite of synthetic diagnostic tasks (matrix transpose, Conway’s Game of Life, and LU decomposition), the authors compare a text-only pathway against a vision-augmented pathway that preserves the original 2D layout while sharing the same language backbone.
Results show the vision-augmented (2D-faithful) pathway consistently outperforms the textual (serialized) pathway across tasks and experimental settings.
The performance gap grows with larger dimensions, and model errors under serialization become more spatially structured, highlighting representation-dependent failure modes.
The authors conclude that preserving task-relevant 2D structure in the input representation is a promising direction and warrants further investigation.

Abstract

Large language models (LLMs) conventionally process structured inputs as 1D token sequences. While natural for prose, such linearization may introduce additional representational burden for tasks whose computation depends directly on explicit 2D structure, because row--column alignment and local neighborhoods are no longer directly expressed in the input. We study this setting, which we refer to as serialization friction, on a small diagnostic testbed of synthetic tasks with explicit 2D structure: matrix transpose, Conway's Game of Life, and LU decomposition. To examine this question, we compare a text-only language pathway over serialized inputs with a vision-augmented pathway, built on the same language backbone, that receives the same underlying content rendered in task-faithful 2D layout, yielding a system-level comparison between two end-to-end input pathways. Across the tasks and settings we study, the visual pathway consistently outperforms the textual pathway; the gap often widens at larger dimensions, and error patterns under serialization become increasingly spatially structured. These findings indicate that the relationship between input representation and model performance on such tasks warrants further investigation, and suggest that preserving task-relevant 2D layout is a promising direction for structured 2D tasks.