Why Cost Optimization Matters
An LLM product's production cost becomes huge cumulatively. Even a few yen per request becomes millions to hundreds of millions of yen/month with users × frequency × 365 days. Being conscious of optimization can be 5-10x more efficient.
1. Prompt Caching
A mechanism cutting input-token cost up to 90% when reusing the same system prompt.
- Anthropic: explicit via cache_control parameter
- OpenAI: automatic caching (when the same prefix hits a certain number of times)
- Google: Context Caching
The effect is enormous in cases like agent operation or RAG where the system prompt and tool definitions are long.
⚠️ Cache-invalidation pitfall: mixing dynamic elements like datetime, username, session ID at the prompt head causes cache misses, applying normal new-token pricing. The fix is to design "static prefix → dynamic tail." In ProjectDiscovery's actual case, just moving working memory to the end of the message cut LLM cost by 59%.
2. Model Cascade
Selectively use multiple models in one app:
- Light (Mini, Nano, Haiku): classification, routing, simple extraction
- Mid (Sonnet, Mistral Large, Gemini Flash): daily work, summary, translation
- Frontier (GPT-5, Claude Opus 4.7): complex reasoning, code gen, agent commander



