Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Reddit r/artificial / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The article describes how a job search automation pipeline initially consumed about 16k tokens per application and became unsustainable under Claude usage limits.
By treating token efficiency as a first-class design constraint, the author reduced usage by roughly 85%, reaching about 900 tokens per application with improved stability.
The biggest improvement came from prompt caching, specifically caching repeated system and profile context (with cache_control: ephemeral), which reduced repeated-operation token spend.
It also improved cost/performance by routing tasks to different Claude models (Haiku/Sonnet/Opus) based on workload, rather than defaulting to the most expensive model.
Additional savings were achieved through precomputing reusable “answer bank” responses, deduplicating repeated work (e.g., semantic TF-IDF filtering), and adding a lightweight classifier to avoid unnecessary deep reasoning.

Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Like many here, I kept running into Claude usage limits when building anything non-trivial.

I was working with a job search automation pipeline (based on the Career-Ops project), and the naive flow was burning ~16k tokens per application — completely unsustainable.

So I spent some time reworking it with a focus on token efficiency as a first-class concern, not an afterthought.

🚀 Results

~85% reduction in token usage
~900 tokens per application
Most repeated context calls eliminated
Much more stable under usage limits

⚡ What actually helped (practical takeaways)

1. Prompt caching (biggest win)

Cached system + profile context (cache_control: ephemeral)
Break-even after 2 calls, strong gains after that
~40% reduction on repeated operations

👉 If you're re-sending the same context every time, you're wasting tokens.

2. Model routing instead of defaulting to Sonnet/Opus

Lightweight tasks → Haiku
Medium reasoning → Sonnet
Heavy tasks only → Opus

👉 Most steps don’t need expensive models.

3. Precompute anything reusable

Built an answer bank (25 standard responses) in one call
Reused across applications

👉 Eliminated ~94% of LLM calls during form filling.

4. Avoid duplicate work

TF-IDF semantic dedup (threshold 0.82)
Filters duplicate job listings before evaluation

👉 Prevents burning tokens on the same content repeatedly.

5. Reduce “over-intelligence”

Added a lightweight classifier step before heavy reasoning
Only escalate to deeper models when needed

👉 Not everything needs full LLM reasoning.

🧠 Key insight

Most Claude workflows hit limits not because they’re complex —
but because they recompute everything every time.

🧩 Curious about others’ setups

How are you handling repeated context?
Anyone using caching aggressively in multi-step pipelines?
Any good patterns for balancing Haiku vs Sonnet vs Opus?

https://github.com/maddykws/jubilant-waddle

Inspired by Santiago Fernández’s Career-Ops — this is a fork focused on efficiency + scaling under usage limits.

submitted by /u/distanceidiot
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Your AI Agent is Reading Poisoned Web Pages.. Here's How to Stop It

Dev.to

Group Lasso with Overlaps: the Latent Group Lasso approach

Dev.to

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

Dev.to

Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Key Points

🚀 Results

⚡ What actually helped (practical takeaways)

1. Prompt caching (biggest win)

2. Model routing instead of defaulting to Sonnet/Opus

3. Precompute anything reusable

4. Avoid duplicate work

5. Reduce “over-intelligence”

🧠 Key insight

🧩 Curious about others’ setups

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Your AI Agent is Reading Poisoned Web Pages.. Here's How to Stop It

Group Lasso with Overlaps: the Latent Group Lasso approach

I Built a CLI AI Coding Assistant from Scratch — Here's What I Learned

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer