100万トークンのコンテキストウィンドウを総当たりで検証するのは正しいアプローチか？

Reddit r/LocalLLaMA / 2026/3/23

💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

共有:

要点

著者は約100万トークンの非常に長いコンテキストウィンドウを検証するため、約80万トークンの org-mode ファイルと約10万通のメールを llama.cpp 経由でセルフホスト型の LLM に入力して検証している。
プリフィル時間は90秒から60分の範囲で、PP（prompt-per-second）レートはおおよそ4700トークン/秒から220トークン/秒、生成トークン速度はモデルにより90トークン/秒から24トークン/秒の範囲だった。
結果は混在しており、ファイルに関する事実的な質問はしばしば誤っているか歪んでおり、より一般的な質問はデタラメな情報を含む出力や使えない出力を生み、似たイベントの混同が頻繁に起こる。
著者は、この用途において --temp（温度設定）が関連するかどうか、1Mトークンのコンテキストを入力して完全なRAGパイプラインを迂回することが実現可能かどうかを疑問視している。
長いコンテキストを持つLLMがなぜ失敗するのか、そして大規模なファイルと maildir ベースのデータセットを透明かつ運用可能にするためのより良いツールが何かを求めている。

I am trying to query and extract information from a large, semi-structured org-mode file (with hierarchical entries and cross links) of about 800000 tokens length (depending on LLM, file size is about 2.5MB). This is basically a notes file spanning about 10 years of practical information of various kind, and definitively way too long to remember what's all inside. The file cross-references also elements of a maildir directory with ca 100000 mails.

I tried to directly feed that org-mode file into self-hosted LLMs by passing a "--ctx-size 0" (= native 1048576 tokens context window), and that works with:

Qwen3-Coder-30B-A3B-Instruct-1M-GGUF BF16
nvidia_Llama-3.1-8B-UltraLong-4M-Instruct-GGUF BF16
Meta/Llama-4-Scout-17B-16E-Instruct-GGUF/UD-Q4_K_XL
NVIDIA-Nemotron-3-Nano-30B-A3B/UD-Q5_K_XL and UD-Q8_K_XL
NVIDIA-Nemotron-3-Super-120B-A12B-GGUF UD-IQ4_XS / UD-Q5_K_S / UD-Q8_K_XL / BF16

I use llama.cpp.

Prefill takes between 90s and 60m (PP between 4700 t/s and 220 t/s), depending on size of the LLM, and token generation after uploading the org-mode file is between 90 and 24 t/s.

Hardware is a Zen5 32-core Threadripper Pro with 512GB of ECC RAM and dual RTX5090.

Yet, — results are mixed, at best. If I simply ask for factual information I do know is in the file, it is frequently answered wrong or distorted, and more general questions result in BS or at least in something totally unusable. A frequent pattern of failure in the answers is confusing and conflating similar events that are noted in the file.

This is a totally different experience than simply chatting with those same models without the enormous 1m token context window, and then the models are actually very good.

Is "--temp" a relevant setting for this use case?

The idea to throw the file directly at a 1M token context model originated as a means to avoid the complexities of a full RAG pipeline.

Why do those LLMs fail with very long contexts and what would be a better tool to make this info (file and maildir) transparent and operable?

submitted by /u/phwlarxoc
[link] [comments]