How I cut my OpenClaw costs in half (Lumin)

Dev.to / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The article explains that OpenClaw-style agentic loops become expensive mainly due to repeatedly sending the same large system prompts, tool/workspace context, and loop structures on every turn.
It describes Lumin, a self-hosted, OpenAI-compatible local proxy that reduces cost by applying compression, caching with freshness/freshness-guard logic, and model routing before requests reach model providers.
The biggest savings came from static context compression and improved handling of repeated context across multi-turn loops, where trimming duplicated prompt material compounds over time.
It highlights a new TOON-based compression layer for structured JSON arrays that avoids repeating field names for uniform tabular-style data.
Freshness and pivot detection help prevent overly aggressive cache reuse when the conversation or task meaningfully shifts to new data, reducing the risk of stale context.

I've been spending a lot of time working with agentic workflows, especially OpenClaw-style loops where the model re-uses big chunks of context every turn.
That gets expensive very fast.

Most of the cost was coming from repeated context, oversized prompts, and sending the same structure back and forth on every turn. You're often not paying for new reasoning, you're paying for the same prompt, tool context, and loop structure again and again.
This led me to build Lumin, a local proxy that sits in front of model calls and reduces cost before the request ever reaches the provider.
GitHub: https://github.com/ryancloto-dot/Lumin

The problem:
Most agentic loops look like this:

Large system prompts sent every turn
Repeated workspace and tool context
Loops re-sending almost identical request structures
Agents paying full price even when most of the prompt hasn't changed

How Lumin works
Lumin sits between your agent and the model provider:
Your agent → Lumin → OpenAI / Anthropic / Google / Ollama / OpenRouter
|
+> compression
+> cache + freshness guards
+> model routing
+> live savings dashboard
It's self-hosted and exposes an OpenAI-compatible endpoint, so the integration is usually one environment variable.

What actually helped

Static context compression
A lot of agent calls include large prompts that barely change turn to turn. If the task is short but the prompt is enormous, there's usually room to trim repeated or low-value sections safely. This was one of the biggest wins on OpenClaw-style workflows.
Repeated context handling

This is where the best numbers came from. If an agent keeps resending very similar context across multiple turns, the savings compound significantly compared to one-shot requests. That's why repeated-context loops showed much stronger results than simpler single-turn prompts.

TOON for structured data

I recently added a TOON-based compression layer for structured JSON arrays. Instead of repeating field names on every row, TOON declares them once in a more token-efficient format before sending to the model.
More on TOON here: https://github.com/toon-format/toon
This doesn't help every workflow, but it helps a lot when the input contains large uniform structured exports.

Freshness guards I added freshness and pivot detection to back off cache reuse when a chain shifts to new data. If the task clearly pivots, Lumin avoids reusing stale compressed or cached context too aggressively. That part still has room to improve, but it mattered enough that I didn't want to optimize cost without it.

Results:
Results depend heavily on workload shape. On the benchmark set:

Average savings: ~11%
Repeated-context loops: up to 57%
Structured export workflows: up to 57.5%

The repeated-context number is the most relevant one for OpenClaw and NanoClaw users where the same context files get injected every turn.

How to integrate:
Option 1: Tell your agent to install it:
Install Lumin from github.com/ryancloto-dot/Lumin
and configure it for my setup

Option 2: One environment variable:
For OpenAI-compatible agents:
OPENAI_BASE_URL=http://localhost:8000/v1
For Anthropic-compatible routing:
ANTHROPIC_BASE_URL=http://localhost:8000/anthropic/main

What's next:

Better answer-quality evals
More freshness and cache invalidation work
Cleaner integration with existing agent shells
Broader workload benchmarking that's easier to reproduce publicly

Try it:
https://github.com/ryancloto-dot/Lumin
If you do try it, please give feedback specifically on setup friction, whether the savings dashboard is clear, where it helps the most, and where it gets to aggressive on quality and doesn't work.