Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Reddit r/artificial / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article describes how a job search automation pipeline initially consumed about 16k tokens per application and became unsustainable under Claude usage limits.
  • By treating token efficiency as a first-class design constraint, the author reduced usage by roughly 85%, reaching about 900 tokens per application with improved stability.
  • The biggest improvement came from prompt caching, specifically caching repeated system and profile context (with cache_control: ephemeral), which reduced repeated-operation token spend.
  • It also improved cost/performance by routing tasks to different Claude models (Haiku/Sonnet/Opus) based on workload, rather than defaulting to the most expensive model.
  • Additional savings were achieved through precomputing reusable “answer bank” responses, deduplicating repeated work (e.g., semantic TF-IDF filtering), and adding a lightweight classifier to avoid unnecessary deep reasoning.
Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Like many here, I kept running into Claude usage limits when building anything non-trivial.

I was working with a job search automation pipeline (based on the Career-Ops project), and the naive flow was burning ~16k tokens per application — completely unsustainable.

So I spent some time reworking it with a focus on token efficiency as a first-class concern, not an afterthought.

🚀 Results

  • ~85% reduction in token usage
  • ~900 tokens per application
  • Most repeated context calls eliminated
  • Much more stable under usage limits

⚡ What actually helped (practical takeaways)

1. Prompt caching (biggest win)

  • Cached system + profile context (cache_control: ephemeral)
  • Break-even after 2 calls, strong gains after that
  • ~40% reduction on repeated operations

👉 If you're re-sending the same context every time, you're wasting tokens.

2. Model routing instead of defaulting to Sonnet/Opus

  • Lightweight tasks → Haiku
  • Medium reasoning → Sonnet
  • Heavy tasks only → Opus

👉 Most steps don’t need expensive models.

3. Precompute anything reusable

  • Built an answer bank (25 standard responses) in one call
  • Reused across applications

👉 Eliminated ~94% of LLM calls during form filling.

4. Avoid duplicate work

  • TF-IDF semantic dedup (threshold 0.82)
  • Filters duplicate job listings before evaluation

👉 Prevents burning tokens on the same content repeatedly.

5. Reduce “over-intelligence”

  • Added a lightweight classifier step before heavy reasoning
  • Only escalate to deeper models when needed

👉 Not everything needs full LLM reasoning.

🧠 Key insight

Most Claude workflows hit limits not because they’re complex —
but because they recompute everything every time.

🧩 Curious about others’ setups

  • How are you handling repeated context?
  • Anyone using caching aggressively in multi-step pipelines?
  • Any good patterns for balancing Haiku vs Sonnet vs Opus?

https://github.com/maddykws/jubilant-waddle

Inspired by Santiago Fernández’s Career-Ops — this is a fork focused on efficiency + scaling under usage limits.

submitted by /u/distanceidiot
[link] [comments]