Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

arXiv cs.AI / 5/5/2026

📰 NewsModels & Research

共有:

Key Points

The study evaluates five frontier LLMs that claim 1M-token context windows on classical Chinese texts, focusing on long-context retrieval and reasoning.
Single-needle retrieval at 1M tokens is essentially solved for the strongest models—Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 all reach 100% accuracy.
Multi-hop reasoning shows different “decay signatures” as context grows: Gemini/Claude stay strong (>80% through 512K with only modest 1M degradation), GPT-5.5/Qwen3.6-plus drop sharply between 512K and 1M, and DeepSeek V4 Pro declines smoothly across the whole range.
The authors conclude that advertised context-window size is a weak predictor of real usable long-context multi-hop performance, with the 512K-to-1M transition being the strongest discriminator among 1M-context models.

Abstract

We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures single-needle retrieval at 1M tokens of input, with three biographical needles planted at three depths and pairs of real (training-prior-consistent) and altered (training-prior-contradicting) variants to separate genuine in-context retrieval from reliance on memorised training data. Test 2, a follow-up designed to probe whether long-context capability degrades when retrieval requires intermediate reasoning, measures three-hop chain traversal across three context tiers (256K, 512K, and 1M tokens). We find that single-needle retrieval at 1M is essentially solved for the strongest models - Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 each achieve 100% - but that multi-hop performance reveals three distinct decay signatures: a stable regime (Gemini Pro, Claude) maintaining greater than 80% accuracy through 512K with modest degradation at 1M; a late-cliff regime (GPT-5.5, Qwen3.6-plus) collapsing sharply between 512K and 1M; and a smooth-decline regime (DeepSeek V4 Pro) decaying gradually across the entire range. The findings suggest that nominal context-window length is a poor proxy for usable long-context multi-hop capability, and that the sharpest discriminator between current 1M-context flagships is the 512K-to-1M transition.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Dev.to

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

Dev.to

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine

Dev.to

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

Reddit r/LocalLLaMA

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

How an AI Agent Executed 500+ Real-World Operations and Built Its Own Recovery Engine

Qwen 3.6 27B MTP on v100 32GB: 54 t/s

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer