Bring coding AI
back to your own servers.

Running a commercial-grade coding AI in-house has, until now, meant racking up several H100s. Cohere's newly released North Mini Code 1.0 is a 30B-scale model that runs on just a single H100—and it's Apache 2.0. The break-even point for self-hosting has quietly shifted.

AI Navigate Editorial·2026.06.10·6 min read

The Wall

The bar for
"running it in-house" was too high

Trying to run a commercial-grade coding AI on your own servers has, until now, meant a realistic floor of eight or more H100s just to reach GPT-4-class capability. Managed services like GitHub Copilot, Cursor, and Claude Code are easy to adopt—but in exchange, your code and internal data go out to the cloud.

For companies in healthcare, finance, and defense that cannot let data leave their walls, or for teams where Cursor's per-seat monthly fees pile up, the bind persisted: "we want to use it but can't / using it hurts the budget." An option that satisfied capability, cost, and data sovereignty all at once barely existed.

Managed service	North Mini Code (self-hosted)
Code and internal data go to the cloud	Everything stays in-house
Seats × monthly fees keep adding up	Low running cost once you own the GPU
8+ H100s was the floor	Runs on a single H100
License follows the service contract	Apache 2.0 open weights

Pull coding AI
back from the cloud to your own hands.

How It Fits

Why a 30B model
runs on one H100

The key is MoE (Mixture-of-Experts). The total is 30B, but only a small slice runs at any one time.

FIG. It holds a 30B body, yet only ~3B runs per token—so it fits on a single H100.

30B

Total parameters (MoE)

~3B

Activated per token only

1×H100

GPU required

Cohere has open-sourced "North Mini Code 1.0," its first dedicated model for developers, under Apache 2.0. It's a 30B-scale Mixture-of-Experts coding model in which only about 3B parameters are activated per token—it holds a huge body, but only some of its "experts" run at once. That's why you can run it on a single NVIDIA H100 without renting an entire data center.

Distribution is geared for real-world use, too. The weights are published on Hugging Face (CohereLabs/North-Mini-Code-1.0), and an fp8 build that saves even more memory is also offered. To just try it, there's a free trial on OpenCode; for production, deployment via vLLM main is supported.

In Practice

Who does it help?

The value of "running it yourself" is greatest for teams that cannot let data leave their walls.

On-prem in regulated industries

Healthcare, finance, defense, and others that can't send their own data to the cloud. Because everything stays inside the corporate network, they can use generative AI while keeping data sovereignty.

Teams stung by billing

Keep paying Cursor's monthly fee per team member and costs swell as usage grows. If you can own the GPU, it's a setup that makes running costs easier to contain.

Slotting into existing infra

Since it supports vLLM main, teams that already run an inference stack can try it relatively quickly. There's no need to build a whole new stack from scratch.

Frontier

What grew was "choice"

One caveat: independent verification of performance is still limited. Whether it stands alongside GPT-4o or Claude Sonnet is best judged not by benchmark figures but by measuring it on your own use case. If you can use cloud APIs without issue, the effort of switching to self-hosting may not be worth it.

Even so, this is real progress. A commercial-grade coding AI you can place in-house on a single H100, under Apache 2.0—that move now sits in front of teams that previously had no option. It's more accurate to see it as a third path added to what used to be a binary choice: "managed, or nothing."

The bar for"running it in-house" was too high

Why a 30B modelruns on one H100