Can GRPO Boost Complex Multimodal Table Understanding?

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multimodal table understanding is hindered by complex table layouts and difficult logical reasoning, where supervised fine-tuning (SFT) is common but reinforcement learning has faced issues like low initial policy accuracy and coarse rewards.
  • It proposes Table-R1, a three-stage reinforcement learning framework that combines a warm-up stage, Perception Alignment GRPO (PA-GRPO) with continuous Tree-Edit-Distance Similarity (TEDS) rewards, and Hint-Completion GRPO (HC-GRPO) using fine-grained, hint-guided residual-step rewards.
  • Experiments on both held-in and held-out datasets show Table-R1 improves table reasoning performance beyond SFT and standard GRPO, indicating that the stage design effectively addresses initialization bottlenecks and reward sparsity.
  • A key result is that Qwen2-VL-7B with Table-R1 surpasses larger table-specific models such as Table-LLaVA 13B, and reaches performance comparable to the closed-source GPT-4o on held-in datasets.
  • Overall, the work suggests GRPO-style RL can be made substantially more effective for multimodal table understanding through tailored reward shaping and multi-phase training.

Abstract

Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model's table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.