AI Navigate

Is brute-forcing a 1M token context window the right approach?

Reddit r/LocalLLaMA / 3/23/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author tests very long context windows (about 1M tokens) by feeding an ~800k-token org-mode file and ~100k emails into self-hosted LLMs via llama.cpp.
  • Prefill times range from 90 seconds to 60 minutes and prompt-per-second (PP) rates from about 4700 t/s to 220 t/s, with token generation speeds ranging from 90 to 24 t/s depending on the model.
  • Results are mixed: factual questions about the file are often wrong or distorted, and more general questions can yield BS or unusable outputs, with frequent conflation of similar events.
  • The author questions whether --temp (temperature) is relevant for this use case and whether bypassing a full RAG pipeline by feeding a 1M-token context is viable.
  • They seek explanations for why long-context LLMs fail and what better tools exist to make a large file and a maildir-based dataset transparent and operable.

I am trying to query and extract information from a large, semi-structured org-mode file (with hierarchical entries and cross links) of about 800000 tokens length (depending on LLM, file size is about 2.5MB). This is basically a notes file spanning about 10 years of practical information of various kind, and definitively way too long to remember what's all inside. The file cross-references also elements of a maildir directory with ca 100000 mails.

I tried to directly feed that org-mode file into self-hosted LLMs by passing a "--ctx-size 0" (= native 1048576 tokens context window), and that works with:

  • Qwen3-Coder-30B-A3B-Instruct-1M-GGUF BF16
  • nvidia_Llama-3.1-8B-UltraLong-4M-Instruct-GGUF BF16
  • Meta/Llama-4-Scout-17B-16E-Instruct-GGUF/UD-Q4_K_XL
  • NVIDIA-Nemotron-3-Nano-30B-A3B/UD-Q5_K_XL and UD-Q8_K_XL
  • NVIDIA-Nemotron-3-Super-120B-A12B-GGUF UD-IQ4_XS / UD-Q5_K_S / UD-Q8_K_XL / BF16

I use llama.cpp.

Prefill takes between 90s and 60m (PP between 4700 t/s and 220 t/s), depending on size of the LLM, and token generation after uploading the org-mode file is between 90 and 24 t/s.

Hardware is a Zen5 32-core Threadripper Pro with 512GB of ECC RAM and dual RTX5090.

Yet, — results are mixed, at best. If I simply ask for factual information I do know is in the file, it is frequently answered wrong or distorted, and more general questions result in BS or at least in something totally unusable. A frequent pattern of failure in the answers is confusing and conflating similar events that are noted in the file.

This is a totally different experience than simply chatting with those same models without the enormous 1m token context window, and then the models are actually very good.

Is "--temp" a relevant setting for this use case?

The idea to throw the file directly at a 1M token context model originated as a means to avoid the complexities of a full RAG pipeline.

Why do those LLMs fail with very long contexts and what would be a better tool to make this info (file and maildir) transparent and operable?

submitted by /u/phwlarxoc
[link] [comments]