How to connect Claude Code CLI to a local llama.cpp server

Reddit r/LocalLLaMA / 3/31/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The guide shows how to point the Claude Code CLI to a locally running llama.cpp server by setting environment variables such as ANTHROPIC_BASE_URL in your shell configuration (.bashrc/.zshrc).
It provides example commands for running Claude Code with a specific model name argument (e.g., Qwen3.5-35B-Thinking) once the base URL is set.
For VS Code users, it explains how to configure Claude Code extension environment variables in settings.json so the extension can route requests to the local server.
The article notes that model names must exactly match those configured in llama-server.ini, and that the local server setup can support dynamic model switching through the preconfigured model list.
It adds troubleshooting guidance: local CLI performance may suffer due to context length, so users may want lower-context models (e.g., Haiku) and set additional Claude Code env vars like CLAUDE_CODE_DISABLE_1M_CONTEXT and CLAUDE_CODE_MAX_OUTPUT_TOKENS.

How to connect Claude Code CLI to a local llama.cpp server

I’ve seen a lot of people struggling to get Claude Code working with a local llama.cpp setup, so here’s a quick guide that worked for me.

1. CLI (Terminal)

Add this to your .bashrc (or .zshrc):

bash export ANTHROPIC_AUTH_TOKEN="not_set" export ANTHROPIC_API_KEY="not_set_either!" export ANTHROPIC_BASE_URL="http://<your-llama.cpp-server>:8080"

Reload your shell:

bash source ~/.bashrc

and run the cli with the model argument:

bash claude --model Qwen3.5-35B-Thinking

2. VS Code setup with the Claude Code extension installed

Edit:

$HOME/.config/Code/User/settings.json

Add:

json "claudeCode.environmentVariables": [ { "name": "ANTHROPIC_BASE_URL", "value": "http://<your-llama.cpp-server>:8080" }, { "name": "ANTHROPIC_AUTH_TOKEN", "value": "dummy" }, { "name": "ANTHROPIC_API_KEY", "value": "sk-no-key-required" }, { "name": "ANTHROPIC_MODEL", "value": "gpt-oss-20b" }, { "name": "ANTHROPIC_DEFAULT_SONNET_MODEL", "value": "Qwen3.5-35B-Thinking-Coding" }, { "name": "ANTHROPIC_DEFAULT_OPUS_MODEL", "value": "Qwen3.5-27B-Thinking-Coding" }, { "name": "ANTHROPIC_DEFAULT_HAIKU_MODEL", "value": "gpt-oss-20b" }, { "name": "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC", "value": "1" } ], "claudeCode.disableLoginPrompt": true

Notes

This setup lets you use llama.cpp’s server (or llama-swap) to dynamically switch models by selecting one of the preconfigured ones in vscode.
Make sure the model names you define here exactly match what you configured in your llama-server.ini.

Edit: So the cli actually did not perform that well in my local tests and i personally prefer other cli's to be true but after u/Robos_Basilisk asked how this plays well with context length that might have been the reason.

So you most probably want to use a model with less context length like the HAIKU model or additionally set the env. vars "CLAUDE_CODE_DISABLE_1M_CONTEXT" and "CLAUDE_CODE_MAX_OUTPUT_TOKENS".

For the list of supported env vars consult: https://code.claude.com/docs/en/env-vars

Edit: u/truthputer pointed out that you most probably also want to set the undocumented env. var: "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0"

submitted by /u/StrikeOner
[link] [comments]