Reducing AI agent token consumption by 90% by fixing the retrieval layer

Reddit r/artificial / 3/27/2026

📰 News

Key Points

  • The article argues that many AI agents waste tokens by retrieving large numbers of documents (e.g., ~200) via simple similarity search and stuffing them into prompts (~50,000 tokens), often leading to retries and higher cost.

Quick insight from building retrieval infrastructure for AI agents:

Most agents stuff 50,000 tokens of context into every prompt. They retrieve 200 documents by cosine similarity, hope the right answer is somewhere in there, and let the LLM figure it out. When it doesn't, and it often doesn't, the agent re-retrieves. Every retry burns more tokens and money.

We built a retrieval engine called Shaped that gives agents 10 ranked results instead of 200. The results are scored by ML models trained on actual interaction data, not just embedding similarity. In production, this means ~2,500 tokens per query instead of 50,000. The agent gets it right the first time, so no retry loops.

The most interesting part: the ranking model retrains on agent feedback automatically. When a user rephrases a question or the agent has to re-retrieve, that signal trains the model. The model on day 100 is measurably better than day 1 without any manual intervention.

We also shipped an MCP server so it works natively with Cursor, Claude Code, Windsurf, VS Code Copilot, Gemini, and OpenAI.

If anyone's working on agent retrieval quality, I'd love to hear what approaches you've tried.

Wrote up the full technical approach here: https://www.shaped.ai/blog/your-agents-retrieval-is-broken-heres-what-we-built-to-fix-it

submitted by /u/skeltzyboiii
[link] [comments]

Reducing AI agent token consumption by 90% by fixing the retrieval layer | AI Navigate