LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

Reddit r/LocalLLaMA / 5/21/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

LlamaStation v0.9 is a Windows GUI for llama.cpp that runs llama-server.exe directly as a subprocess, exposing full control over every command-line flag without intermediate abstractions.
The app adds multiple switchable backends (official llama.cpp with MTP, TurboQuant, AtomicChat, and an experimental BeeLlama), aiming to preserve llama.cpp performance while improving usability.
It includes real-time per-GPU VRAM monitoring, per-model saved profiles, an offline voice mode (XTTS v2 cloning plus faster-whisper recognition), and a headless mode for automation.
Users can benefit from an auto-updater that updates the underlying llama.cpp build and can check for AtomicChat releases from within the app.
The author reports strong practical gains on long-context workloads (e.g., Qwen3.6 27B) using TurboQuant KV cache with MTP, and invites community feedback and contributions, especially for Linux/Mac support and additional backend integrations.

I've been building this for the past few months as a side project — started because I didn't want to run llama.cpp from the command line every time I wanted to try a model. I just wanted something that worked with a click.
Fair warning: I'm not a developer. This is 100% vibe coded with AI assistance. If something in the codebase makes you cringe, please be kind and open a PR instead 🙏
Most frontends either hide everything behind abstractions (Ollama, LM Studio) or leave you writing command lines manually. LlamaStation tries to sit in the middle: a clean UI with full access to every parameter.
What makes it different
Runs llama-server directly — no intermediate layer, no daemon, no abstraction. LlamaStation launches llama-server.exe as a subprocess with full control over every flag. What you configure is exactly what gets passed to the binary. This means you get the full performance of llama.cpp with none of the overhead that tools like Ollama add on top.
Multiple backends, switchable from the UI:

⚡ Official llama.cpp (with MTP support since PR #22673)
🔬 TurboQuant fork — asymmetric KV cache quantization. This is the killer feature for me: 200k+ context on 24GB VRAM (dual RTX 3060) with minimal quality loss
⚛️ AtomicChat — TurboQuant + MTP combined
🐝 BeeLlama — DFlash + TurboQuant (experimental)

Real-time VRAM meter per GPU — color coded, updates live as the model loads.
Per-model profiles — every setting remembered automatically per model file.
Voice mode — push-to-talk or always-listening, voice cloning via XTTS v2, speech recognition via faster-whisper. Fully offline.
Headless mode — run without GUI using saved profiles, for servers or automation.
Auto-updater — updates llama.cpp official (and checks AtomicChat releases) from inside the app.

My setup for context
Dual RTX 3060 (24GB total), Ryzen 7 5700X, 32GB DDR4 3600MHz, Windows 11. Running Qwen3.6 27B Q4_K_M with TurboQuant KV cache and MTP — 177k context. Without MTP the same model starts at ~17 tok/s and drops to ~10 on long responses. With MTP it starts at ~29 tok/s and holds at ~22 even on long code generation. This is what I built LlamaStation for.

Status
v0.9 — it works well for my daily use. I've fully replaced other tools with it — I use it as the backend for coding agents, Telegram bots, voice assistants and other local automations. There's one known bug (server watchdog gets stuck in "restarting" state after OOM crash) and probably others I haven't hit yet. Opening it up to get feedback and contributions.
Not a programmer by trade — built this entirely with AI assistance. The codebase is a single main file by design, easy to read and modify.
Contributions very welcome — especially:

Linux/Mac port (currently Windows only)
Bug fixes
New backend integrations
UI improvements

GitHub — MIT license, no telemetry, no accounts.

- u/Responsible_Egg9736

submitted by /u/pmttyji
[link] [comments]

Black Hat USA

AI Business

Web devs sleeping with the enemy: AI is doing their job and they worry it's after their desk too

The Register

Revolutionizing Hotel Front Desk with AI

Dev.to

Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model

Dev.to

Plagio con IA: ChatGPT copió su tutorial con todo y los enlaces internos

Dev.to

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

Key Points

Related Articles

Black Hat USA

Web devs sleeping with the enemy: AI is doing their job and they worry it's after their desk too

Revolutionizing Hotel Front Desk with AI

Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model

Plagio con IA: ChatGPT copió su tutorial con todo y los enlaces internos

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer