How to Utilize Complementary Vision-Text Information for 2D Structure Understanding
arXiv cs.CL / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- LLMs typically linearize 2D tables into 1D sequences, weakening row-column adjacency and layout cues.
- Pure visual encoders can capture spatial cues but often struggle to preserve exact cell text.
- The article shows that vision and text provide highly complementary information to LLMs, but simple fusion yields limited gains and can cause cross-modal interference.
- They propose DiVA-Former, a lightweight architecture that uses visual tokens as dynamic queries to distill long textual sequences into digest vectors, effectively leveraging complementary vision–text information.
- Across 13 table benchmarks, DiVA-Former improves the pure-text baseline by 23.9% and consistently outperforms baselines using visual, textual, or both inputs.




