Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Chart-RL, a reinforcement learning framework designed to improve vision-language model performance on chart question answering by strengthening both visual perception and logical inference.
  • It targets key CQA failures in existing VLMs, including inaccurate numerical extraction, misreading implicit relationships in charts, and weak attention to spatial structure.
  • Chart-RL uses feedback-driven policy optimization with adaptive reward functions, and the authors report better results than baseline foundation models and competitive performance versus larger state-of-the-art systems.
  • Using RL plus parameter-efficient fine-tuning via LoRA, the method can run with a single-GPU setup while maintaining performance, and it benchmarks across multiple model families on the ChartQAPro dataset.
  • A highlighted result is RL fine-tuning Qwen3-VL-4B-Instruct to 0.634 answer accuracy (vs. 0.580 for the 8B foundation model) while cutting inference latency from 31s to 9s.

Abstract

The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.