Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?
arXiv cs.CL / 4/29/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study examines whether changing the way questions are phrased (multiple-choice, true/false, short/long answers) affects LLM accuracy on reasoning tasks.
- Across five LLMs and two evaluation dimensions (reasoning-step accuracy and final-answer selection accuracy), performance varies significantly by question type.
- Reasoning-step accuracy does not always predict how accurately the model picks the final answer, indicating a potential mismatch between intermediate reasoning and outcome selection.
- The number of answer options and specific wording in the questions can meaningfully influence LLM performance.
- Overall, the paper highlights that evaluation results for reasoning benchmarks may depend heavily on prompt/question formatting rather than only model reasoning capability.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product
Dev.to