Evaluating Prompting Strategies for Chart Question Answering with Large Language Models

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study systematically compares four prompting strategies (Zero-Shot, Few-Shot, and their Chain-of-Thought variants) for large language model performance on chart question answering using only structured chart inputs.
Across GPT-3.5, GPT-4, and GPT-4o evaluated on 1,200 ChartQA samples, Few-Shot Chain-of-Thought achieves the best overall results, reaching up to 78.2% accuracy, especially for reasoning-heavy questions.
Few-Shot prompting (without Chain-of-Thought) is shown to improve output format adherence, indicating a tradeoff between reasoning quality and response structure consistency.
Zero-Shot prompting tends to work well only for higher-capacity models and primarily on simpler tasks, suggesting that prompting design is crucial for structured-data reasoning.
The authors provide practical guidance for choosing prompting methods in real-world structured chart reasoning systems, balancing efficiency and accuracy.

Abstract

Prompting strategies affect LLM reasoning performance, but their role in chart-based QA remains underexplored. We present a systematic evaluation of four widely used prompting paradigms (Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought) across GPT-3.5, GPT-4, and GPT-4o on the ChartQA dataset. Our framework operates exclusively on structured chart data, isolating prompt structure as the only experimental variable, and evaluates performance using two metrics: Accuracy and Exact Match. Results from 1,200 diverse ChartQA samples show that Few-Shot Chain-of-Thought prompting consistently yields the highest accuracy (up to 78.2\%), particularly on reasoning-intensive questions, while Few-Shot prompting improves format adherence. Zero-Shot performs well only with high-capacity models on simpler tasks. These findings provide actionable guidance for selecting prompting strategies in structured data reasoning tasks, with implications for both efficiency and accuracy in real-world applications.

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Scaffolded Test-First Prompting: Get Correct Code From the First Run

Dev.to

Evaluating Prompting Strategies for Chart Question Answering with Large Language Models

Key Points

Abstract

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Scaffolded Test-First Prompting: Get Correct Code From the First Run

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer