Reducing AI agent token consumption by 90% by fixing the retrieval layer

Quick insight from building retrieval infrastructure for AI agents:

Most agents stuff 50,000 tokens of context into every prompt. They retrieve 200 documents by cosine similarity, hope the right answer is somewhere in there, and let the LLM figure it out. When it doesn't, and it often doesn't, the agent re-retrieves. Every retry burns more tokens and money.

We built a retrieval engine called Shaped that gives agents 10 ranked results instead of 200. The results are scored by ML models trained on actual interaction data, not just embedding similarity. In production, this means ~2,500 tokens per query instead of 50,000. The agent gets it right the first time, so no retry loops.

The most interesting part: the ranking model retrains on agent feedback automatically. When a user rephrases a question or the agent has to re-retrieve, that signal trains the model. The model on day 100 is measurably better than day 1 without any manual intervention.

We also shipped an MCP server so it works natively with Cursor, Claude Code, Windsurf, VS Code Copilot, Gemini, and OpenAI.

If anyone's working on agent retrieval quality, I'd love to hear what approaches you've tried.

Wrote up the full technical approach here: https://www.shaped.ai/blog/your-agents-retrieval-is-broken-heres-what-we-built-to-fix-it

submitted by /u/skeltzyboiii
[link] [comments]

Reducing AI agent token consumption by 90% by fixing the retrieval layer

Key Points

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer