Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text
arXiv cs.AI / 5/5/2026
📰 NewsModels & Research
Key Points
- The study evaluates five frontier LLMs that claim 1M-token context windows on classical Chinese texts, focusing on long-context retrieval and reasoning.
- Single-needle retrieval at 1M tokens is essentially solved for the strongest models—Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 all reach 100% accuracy.
- Multi-hop reasoning shows different “decay signatures” as context grows: Gemini/Claude stay strong (>80% through 512K with only modest 1M degradation), GPT-5.5/Qwen3.6-plus drop sharply between 512K and 1M, and DeepSeek V4 Pro declines smoothly across the whole range.
- The authors conclude that advertised context-window size is a weak predictor of real usable long-context multi-hop performance, with the 512K-to-1M transition being the strongest discriminator among 1M-context models.
Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents
Dev.to

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo
Dev.to

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)
Dev.to

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine
Dev.to
Qwen 3.6 27B MTP on v100 32GB: 54 t/s
Reddit r/LocalLLaMA