background: I write mostly in Korean and my Claude API bill is kind of embarrassing. Korean tokenizes really inefficiently compared to English for the same meaning, so a chunk of the cost is basically just encoding overhead.
the idea is a small proxy in Bun that sits in front of the Claude API. Claude Code talks to localhost, doesn't know anything changed. before each request goes out, Gemma4 E2B (llama.cpp, local) would do:
- translate Korean input to English. response still comes back in Korean, just the outbound prompt is English
- trim context that's probably not relevant to the current turn
- for requests that look like they need reasoning, have Gemma4 do the thinking first and pass the result along — so the paid model hopefully skips some of that work and uses fewer reasoning tokens
planning to cache with SQLite in WAL mode to avoid read/write contention on every request.
one thing I'm genuinely unsure about before I start building: does pre-supplying reasoning actually save anything, or does the model just redo it internally anyway and charge you for it regardless.
the bigger concern is speed. the whole point breaks down if Gemma4 adds more latency than it saves money. has anyone actually run Gemma4 E2B on an Intel Mac? curious what kind of tokens/sec you're getting with llama.cpp on that hardware specifically — Apple Silicon numbers are everywhere but Intel is harder to find
[link] [comments]




