Bring coding AI
back to your own servers.
Running a commercial-grade coding AI in-house has, until now, meant racking up several H100s. Cohere's newly released North Mini Code 1.0 is a 30B-scale model that runs on just a single H100—and it's Apache 2.0. The break-even point for self-hosting has quietly shifted.
The bar for
"running it in-house" was too high
Trying to run a commercial-grade coding AI on your own servers has, until now, meant a realistic floor of eight or more H100s just to reach GPT-4-class capability. Managed services like GitHub Copilot, Cursor, and Claude Code are easy to adopt—but in exchange, your code and internal data go out to the cloud.
For companies in healthcare, finance, and defense that cannot let data leave their walls, or for teams where Cursor's per-seat monthly fees pile up, the bind persisted: "we want to use it but can't / using it hurts the budget." An option that satisfied capability, cost, and data sovereignty all at once barely existed.
| Managed service | North Mini Code (self-hosted) |
|---|---|
| Code and internal data go to the cloud | Everything stays in-house |
| Seats × monthly fees keep adding up | Low running cost once you own the GPU |
| 8+ H100s was the floor | Runs on a single H100 |
| License follows the service contract | Apache 2.0 open weights |
Pull coding AI
back from the cloud to your own hands.
Why a 30B model
runs on one H100
The key is MoE (Mixture-of-Experts). The total is 30B, but only a small slice runs at any one time.
Cohere has open-sourced "North Mini Code 1.0," its first dedicated model for developers, under Apache 2.0. It's a 30B-scale Mixture-of-Experts coding model in which only about 3B parameters are activated per token—it holds a huge body, but only some of its "experts" run at once. That's why you can run it on a single NVIDIA H100 without renting an entire data center.
Distribution is geared for real-world use, too. The weights are published on Hugging Face (CohereLabs/North-Mini-Code-1.0), and an fp8 build that saves even more memory is also offered. To just try it, there's a free trial on OpenCode; for production, deployment via vLLM main is supported.
Who does it help?
The value of "running it yourself" is greatest for teams that cannot let data leave their walls.
On-prem in regulated industries
Healthcare, finance, defense, and others that can't send their own data to the cloud. Because everything stays inside the corporate network, they can use generative AI while keeping data sovereignty.
Teams stung by billing
Keep paying Cursor's monthly fee per team member and costs swell as usage grows. If you can own the GPU, it's a setup that makes running costs easier to contain.
Slotting into existing infra
Since it supports vLLM main, teams that already run an inference stack can try it relatively quickly. There's no need to build a whole new stack from scratch.
What grew was "choice"
One caveat: independent verification of performance is still limited. Whether it stands alongside GPT-4o or Claude Sonnet is best judged not by benchmark figures but by measuring it on your own use case. If you can use cloud APIs without issue, the effort of switching to self-hosting may not be worth it.
Even so, this is real progress. A commercial-grade coding AI you can place in-house on a single H100, under Apache 2.0—that move now sits in front of teams that previously had no option. It's more accurate to see it as a third path added to what used to be a binary choice: "managed, or nothing."