Omnicoder-9b SLAPS in Opencode

Reddit r/LocalLLaMA / 3/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

OmniCoder-9B-GGUF running on an 8GB VRAM system with Q4_k_m GGUF via Opencode reportedly delivers around 40 tps and fast prompt throughput, with a concrete command and config shared to reproduce.
The post frames open-source models as a cost-effective alternative to quotas and price hikes from proprietary services, citing Copilot and Google's premium pricing.
A potential bug causing full prompt reprocessing is noted, and the author provides a reproduction setup including an Opencode config and a llama-server command to help others test.
The author observes that higher-context variants like q5_ks with 64k context can maintain similar speeds, and suggests MOEs might be better but slower depending on trade-offs.

I was feeling a bit disheartened by seeing how anti-gravity and github copilot were now putting heavy quota restrictions and I kinda felt internally threatened that this was the start of the enshitification and price hikes. Google is expecting you to pay $250 or you will only be taste testing their premium models.

I have 8gb vram, so I usually can't run any capable open source models for agentic coding at good speeds, I was messing with qwen3.5-9b and today I saw a post of a heavy finetune of qwen3.5-9b on Opus traces and I just was just gonna try it then cry about shitty performance and speeds but holyshit...

https://huggingface.co/Tesslate/OmniCoder-9B

I ran Q4_km gguf with ik_llama at 100k context and then set it up with opencode to test it and it just completed my test tasks flawlessly and it was fast as fuck, I was getting like 40tps plus and pp speeds weren't bad either.

I ran it with this

ik_llama.cpp\build\bin\Release\llama-server.exe -m models/Tesslate/OmniCoder-9B-GGUF/omnicoder-9b-q4_k_m.gguf -ngl 999 -fa 1 -b 2048 -ub 512 -t 8 -c 100000 -ctk f16 -ctv q4_0 --temp 0.4 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --jinja --ctx-checkpoints 0

I am getting insane speed and performance. You can even go for q5_ks with 64000 context for the same speeds.

Although, there is probably a bug that causes full prompt reprocessing which I am trying to figure out how to fix.

this is my opencode config that I used for this:

 "local": { "models": { "/models/Tesslate/OmniCoder-9B-GGUF/omnicoder-9b-q4_k_m.gguf": { "interleaved": { "field": "reasoning_content" }, "limit": { "context": 100000, "output": 32000 }, "name": "omnicoder-9b-q4_k_m", "reasoning": true, "temperature": true, "tool_call": true } }, "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://localhost:8080/v1" } }, Anyone struggling with 8gb vram should try this. MOEs might be better but the speeds suck asssssss.

submitted by /u/True_Requirement_891
[link] [comments]