Evaluating Prompting Strategies for Chart Question Answering with Large Language Models

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study systematically compares four prompting strategies (Zero-Shot, Few-Shot, and their Chain-of-Thought variants) for large language model performance on chart question answering using only structured chart inputs.
  • Across GPT-3.5, GPT-4, and GPT-4o evaluated on 1,200 ChartQA samples, Few-Shot Chain-of-Thought achieves the best overall results, reaching up to 78.2% accuracy, especially for reasoning-heavy questions.
  • Few-Shot prompting (without Chain-of-Thought) is shown to improve output format adherence, indicating a tradeoff between reasoning quality and response structure consistency.
  • Zero-Shot prompting tends to work well only for higher-capacity models and primarily on simpler tasks, suggesting that prompting design is crucial for structured-data reasoning.
  • The authors provide practical guidance for choosing prompting methods in real-world structured chart reasoning systems, balancing efficiency and accuracy.

Abstract

Prompting strategies affect LLM reasoning performance, but their role in chart-based QA remains underexplored. We present a systematic evaluation of four widely used prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on the ChartQA dataset. Our framework operates exclusively on structured chart data, isolating prompt structure as the only experimental variable, and evaluates performance using two metrics: Accuracy and Exact Match. Results from 1,200 diverse ChartQA samples show that Few-Shot Chain-of-Thought prompting consistently yields the highest accuracy (up to 78.2\%), particularly on reasoning-intensive questions, while Few-Shot prompting improves format adherence. Zero-Shot performs well only with high-capacity models on simpler tasks. These findings provide actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for both efficiency and accuracy in real-world applications.