How to connect Claude Code CLI to a local llama.cpp server

Reddit r/LocalLLaMA / 3/31/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The guide shows how to point the Claude Code CLI to a locally running llama.cpp server by setting environment variables such as ANTHROPIC_BASE_URL in your shell configuration (.bashrc/.zshrc).
  • It provides example commands for running Claude Code with a specific model name argument (e.g., Qwen3.5-35B-Thinking) once the base URL is set.
  • For VS Code users, it explains how to configure Claude Code extension environment variables in settings.json so the extension can route requests to the local server.
  • The article notes that model names must exactly match those configured in llama-server.ini, and that the local server setup can support dynamic model switching through the preconfigured model list.
  • It adds troubleshooting guidance: local CLI performance may suffer due to context length, so users may want lower-context models (e.g., Haiku) and set additional Claude Code env vars like CLAUDE_CODE_DISABLE_1M_CONTEXT and CLAUDE_CODE_MAX_OUTPUT_TOKENS.

How to connect Claude Code CLI to a local llama.cpp server

I’ve seen a lot of people struggling to get Claude Code working with a local llama.cpp setup, so here’s a quick guide that worked for me.


1. CLI (Terminal)

Add this to your .bashrc (or .zshrc):

bash export ANTHROPIC_AUTH_TOKEN="not_set" export ANTHROPIC_API_KEY="not_set_either!" export ANTHROPIC_BASE_URL="http://<your-llama.cpp-server>:8080"

Reload your shell:

bash source ~/.bashrc

and run the cli with the model argument:

bash claude --model Qwen3.5-35B-Thinking


2. VS Code setup with the Claude Code extension installed

Edit:

$HOME/.config/Code/User/settings.json

Add:

json "claudeCode.environmentVariables": [ { "name": "ANTHROPIC_BASE_URL", "value": "http://<your-llama.cpp-server>:8080" }, { "name": "ANTHROPIC_AUTH_TOKEN", "value": "dummy" }, { "name": "ANTHROPIC_API_KEY", "value": "sk-no-key-required" }, { "name": "ANTHROPIC_MODEL", "value": "gpt-oss-20b" }, { "name": "ANTHROPIC_DEFAULT_SONNET_MODEL", "value": "Qwen3.5-35B-Thinking-Coding" }, { "name": "ANTHROPIC_DEFAULT_OPUS_MODEL", "value": "Qwen3.5-27B-Thinking-Coding" }, { "name": "ANTHROPIC_DEFAULT_HAIKU_MODEL", "value": "gpt-oss-20b" }, { "name": "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC", "value": "1" } ], "claudeCode.disableLoginPrompt": true


Notes

  • This setup lets you use llama.cpp’s server (or llama-swap) to dynamically switch models by selecting one of the preconfigured ones in vscode.
  • Make sure the model names you define here exactly match what you configured in your llama-server.ini.

Edit: So the cli actually did not perform that well in my local tests and i personally prefer other cli's to be true but after u/Robos_Basilisk asked how this plays well with context length that might have been the reason.

So you most probably want to use a model with less context length like the HAIKU model or additionally set the env. vars "CLAUDE_CODE_DISABLE_1M_CONTEXT" and "CLAUDE_CODE_MAX_OUTPUT_TOKENS".

For the list of supported env vars consult: https://code.claude.com/docs/en/env-vars

Edit: u/truthputer pointed out that you most probably also want to set the undocumented env. var: "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0"

submitted by /u/StrikeOner
[link] [comments]