Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that traditional multiple-choice LLM benchmarks with a small number of options can be inflated by shortcut strategies, masking a model’s true competence.
It proposes a “massive option evaluation” protocol that expands the candidate set to up to 100 choices to reduce chance-level effects and produce more stable, reliable accuracy estimates.
Applied to Korean orthography error detection, the method helps disentangle genuine content-related failures from artifacts such as positional bias by using repeated resampling and shuffling.
Experiments show that models that appear strong with low option counts often lose advantage when the distractor set becomes dense, indicating capability gaps conventional benchmarks may hide.
The study identifies two key failure modes—semantic confusion and early-option position bias under uncertainty—and suggests candidate ranking (not context length) is the main bottleneck via padding- and length-matched tests.

Abstract

Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high

N

, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.

ChatGPT for Nurses: Prompts That Help You Document, Communicate, and Study

Dev.to

I Added a Stopwatch to My AI in 1 LOC Using the Livingrimoire While Corporations Need a Year

Dev.to

Built tasuki — an AI CLI Orchestrator that Seamlessly Hands Off Between Tools

Dev.to

I built a GNOME extension for Codex with local/remote history, live filters, Markdown export, and a read-only MCP server

Reddit r/artificial

I Built an Open‑ Source OS for AI Agents – And It’s Ready for You

Dev.to

Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

Key Points

Abstract

Related Articles

ChatGPT for Nurses: Prompts That Help You Document, Communicate, and Study

I Added a Stopwatch to My AI in 1 LOC Using the Livingrimoire While Corporations Need a Year

Built tasuki — an AI CLI Orchestrator that Seamlessly Hands Off Between Tools

I built a GNOME extension for Codex with local/remote history, live filters, Markdown export, and a read-only MCP server

I Built an Open‑ Source OS for AI Agents – And It’s Ready for You

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer