Many of us found ourselves in a shared state of panic this week when we realized we might suddenly lose access to Claude Code for agentic coding purposes. It's become a critical part of developer workflows, and productivity is now gated by rate limits, pricing tiers, and the ability for people at one corporation to reliably purchase GPUs from people at another corporation.
I'm not excited by the future that implies, and I think it's shitty for developer productivity to be constrained in general. I also don't want to be surprised about changes to critical workflows / tooling, and I'm sure you don't either. So I wanted to share a drop-in replacement for Sonnet workflows within Claude Code that leverages Qwen3.6-27B-FP8 (released earlier this week and outperforming Sonnet on many benchmarks).
In a few commands, you can have:
- No rate limits
- No token-based pricing
- Guaranteed consistent quality over time (no sneaky quantization at peak hours)
- Claude Code without Anthropic models
- Cost of ~$1.66 per hour (but NO interuptions to coding and you can manage spend on-demand)
Or, to visualize the endstate: https://imgur.com/a/EWQkEL8
Doing this involves a critical assumption: YOU as a developer are better at finding and sourcing your own GPU than any large company is, and you can do it cheaper. This is largely true, and services like Voltage Park, Google Cloud, AWS, and a bunch of others will gladly give you direct GPU access at hourly rates instead of tiered monthly pricing.
For this guide, we'll be using Thunder Compute, who advertise "the World's Cheapest GPUs" but also have a CLI tool for management and agent-friendly documentation. I had never heard of this service before this week and have no affiliation with them, but it goes to show how many players there are in this space to choose from.
Quickstart Guide
Open up two terminals. This guide assumes you have SSH and thunder-cli installed locally and some credits in your account.
Terminal 1: Remote inference
You will need to provision a remote machine to host and run your model.
NOTE: This will cost money for every second you are running the machine, so the clock starts now until the end of the guide. You can stop it at any time by running "tnr delete 0" to delete the first instance in the list (use "tnr status" to get the instance ID if you have more than one instance running).
tnr create --mode prototyping --gpu h100 --vcpus 8 --template base --disk-size-gb 100 --ephemeral-disk 200; tnr status;
Once the status command shows that the machine is "RUNNING", hit ctrl+c to exit.
Next, we need to connect to the remote machine, but also give our local Claude Code a way to talk to it. We do that by leveraging SSH tunneling to securely forward the ports between the two. Think of it just like running Ollama or llama.cpp locally, but the tunnel bridges someone else's machine with a fancy GPU, and yours.
tnr connect 0 -t 8000;
In your console, you should now be logged into a remote machine as "thunder@<remote_machine>". Awesome, now copy and paste this command block, sit back, and wait for your model to load (may take 5-10 minutes and is your biggest cold-start hit):
uv venv --python 3.12; source .venv/bin/activate; uv pip install vllm==0.19.1; sudo /sbin/ldconfig; export HF_HUB_CACHE=/ephemeral; hf download Qwen/Qwen3.6-27B-FP8; vllm serve Qwen/Qwen3.6-27B-FP8 \ --served-model-name qwen3.6-27b \ --trust-remote-code \ --language-model-only \ --max-model-len 131072 \ --gpu-memory-utilization 0.95 \ --tensor-parallel-size 1 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-prefix-caching \ --gdn-prefill-backend triton \ --host 0.0.0.0 \ --port 8000
Once vLLM is ready, it should display a message saying it is running on port 8000. Once you see that, you can set up your Claude Code terminal. You can leave vLLM running until you're done (hit ctrl+c to stop it anytime).
Terminal 2: Claude Code
You're probably going to want to log out of your Anthropic account before doing all this, just to be on the safe side. Do that with "claude /logout".
Next, you'll need to point Claude Code to your vLLM instance. These commands will only last for each terminal session, so it goes away when you close the terminal.
export ANTHROPIC_BASE_URL="http://localhost:8000"; export CLAUDE_CODE_AUTO_COMPACT_WINDOW=117965; export ANTHROPIC_API_KEY="not-a-real-key"; export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.6-27b"; export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.6-27b"; export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.6-27b";
Finally, run Claude Code and get hacking:
claude --model qwen3.6-27b;
This gives you a dedicated, FAST model that plays nice with existing agentic coding workflows. You can tweak the model, instance config, context windows, compaction, and other settings to fit your needs.
When you're done, close Claude Code, close vLLM, exit the thunder@ SSH session, then tear down your instance:
tnr status; tnr delete <instance_id>;
Start the process over again / refill credits as needed. It's a different set of trade-offs, but for those not blessed with local VRAM this may be a good middle ground for agentic workflows as the "value" of remote inference tiers continues to shift.
Other approaches that I tried so you don't have to
- llama.cpp: They don't distribute pre-built CUDA binaries, so you'll have to compile from source. Doable, but annoying. Unsloth would likely shred if you need more TPM and can sacrifice quality.
- ollama: Thunder Compute even has a template for this, but version skew means this week's Qwen3.6 release wasn't supported yet. Also need to compile from source if desired.
- Apache Ray: Really the only viable competitor to vLLM in terms of production inference serving capabilities / performance. Good for special use cases and customization for specific models / REALLY granular performance tuning in transformers.
- Non-ephemeral storage for serving: It's networked and too slow, vLLM takes forever to load the checkpoint shards. Don't bother with large models.
References
[link] [comments]


