Hey r/LocalLLaMA,
I am releasing my first model quantization: an 8-bit symmetric AWQ (W8A16) of kai-os/Carnice-9b, specifically optimized for Ampere GPUs (RTX 30-series) using vLLM with the Marlin kernel on a single-GPU inference setup.
kai-os/Carnice-9b is a specialized fine-tune of Qwen/Qwen3.5-9B that removes the visual components and adopts the Qwen3_5ForCausalLM architecture for pure text/agentic use (Hermes Agent harness). This architecture is not yet natively supported by vLLM (pending PR #39316).
To enable seamless loading, the quantized checkpoint re-wraps the weights into the Qwen3_5ForConditionalGeneration architecture (matching the original Qwen/Qwen3.5-9B configuration). This allows vLLM to serve it correctly with the --language-model-only flag for text-only inference.
Model: https://huggingface.co/TurbulenceDeterministe/Carnice-9b-W8A16-AWQ
Benchmark highlights (vLLM bench on random dataset, single RTX 3090 + Marlin):
• Average prompt throughput: ~1,994 tokens/s
• Average generation throughput: ~222 tokens/s
I'm gonna run some benchmarks specific to the Hermes agent environment (Terminal Bench Lite and YC bench). From a quick vibecheck it seems pretty good
Quick vLLM usage (single GPU):
vllm serve TurbulenceDeterministe/Carnice-9b-W8A16-AWQ \ --max-model-len auto \ --reasoning-parser qwen3 \ --language-model-only \ --tensor-parallel-size 1 I would greatly appreciate your feedback on how to improve future quantizations. Thank you!
[link] [comments]




