Is brute-forcing a 1M token context window the right approach?

Reddit r/LocalLLaMA / 3/23/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The author tests very long context windows (about 1M tokens) by feeding an ~800k-token org-mode file and ~100k emails into self-hosted LLMs via llama.cpp.
Prefill times range from 90 seconds to 60 minutes and prompt-per-second (PP) rates from about 4700 t/s to 220 t/s, with token generation speeds ranging from 90 to 24 t/s depending on the model.
Results are mixed: factual questions about the file are often wrong or distorted, and more general questions can yield BS or unusable outputs, with frequent conflation of similar events.
The author questions whether --temp (temperature) is relevant for this use case and whether bypassing a full RAG pipeline by feeding a 1M-token context is viable.
They seek explanations for why long-context LLMs fail and what better tools exist to make a large file and a maildir-based dataset transparent and operable.

I am trying to query and extract information from a large, semi-structured org-mode file (with hierarchical entries and cross links) of about 800000 tokens length (depending on LLM, file size is about 2.5MB). This is basically a notes file spanning about 10 years of practical information of various kind, and definitively way too long to remember what's all inside. The file cross-references also elements of a maildir directory with ca 100000 mails.

I tried to directly feed that org-mode file into self-hosted LLMs by passing a "--ctx-size 0" (= native 1048576 tokens context window), and that works with:

Qwen3-Coder-30B-A3B-Instruct-1M-GGUF BF16
nvidia_Llama-3.1-8B-UltraLong-4M-Instruct-GGUF BF16
Meta/Llama-4-Scout-17B-16E-Instruct-GGUF/UD-Q4_K_XL
NVIDIA-Nemotron-3-Nano-30B-A3B/UD-Q5_K_XL and UD-Q8_K_XL
NVIDIA-Nemotron-3-Super-120B-A12B-GGUF UD-IQ4_XS / UD-Q5_K_S / UD-Q8_K_XL / BF16

I use llama.cpp.

Prefill takes between 90s and 60m (PP between 4700 t/s and 220 t/s), depending on size of the LLM, and token generation after uploading the org-mode file is between 90 and 24 t/s.

Hardware is a Zen5 32-core Threadripper Pro with 512GB of ECC RAM and dual RTX5090.

Yet, — results are mixed, at best. If I simply ask for factual information I do know is in the file, it is frequently answered wrong or distorted, and more general questions result in BS or at least in something totally unusable. A frequent pattern of failure in the answers is confusing and conflating similar events that are noted in the file.

This is a totally different experience than simply chatting with those same models without the enormous 1m token context window, and then the models are actually very good.

Is "--temp" a relevant setting for this use case?

The idea to throw the file directly at a 1M token context model originated as a means to avoid the complexities of a full RAG pipeline.

Why do those LLMs fail with very long contexts and what would be a better tool to make this info (file and maildir) transparent and operable?

submitted by /u/phwlarxoc
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/23DailyView insight →

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

Is brute-forcing a 1M token context window the right approach?

Key Points

💡 Insights using this article

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer