Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

arXiv cs.AI / 5/5/2026

📰 NewsModels & Research

Key Points

  • The study evaluates five frontier LLMs that claim 1M-token context windows on classical Chinese texts, focusing on long-context retrieval and reasoning.
  • Single-needle retrieval at 1M tokens is essentially solved for the strongest models—Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 all reach 100% accuracy.
  • Multi-hop reasoning shows different “decay signatures” as context grows: Gemini/Claude stay strong (>80% through 512K with only modest 1M degradation), GPT-5.5/Qwen3.6-plus drop sharply between 512K and 1M, and DeepSeek V4 Pro declines smoothly across the whole range.
  • The authors conclude that advertised context-window size is a weak predictor of real usable long-context multi-hop performance, with the 512K-to-1M transition being the strongest discriminator among 1M-context models.

Abstract

We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but that multi-hop performance reveals three distinct decay signatures: a stable regime (Gemini Pro, Claude) maintaining greater than 80% accuracy through 512K with modest degradation at 1M; a late-cliff regime (GPT-5.5, Qwen3.6-plus) collapsing sharply between 512K and 1M; and a smooth-decline regime (DeepSeek V4 Pro) decaying gradually across the entire range. The findings suggest that nominal context-window length is a poor proxy for usable long-context multi-hop capability, and that the sharpest discriminator between current 1M-context flagships is the 512K-to-1M transition.